r - 聚类 - 如何找到离集群最近的

Question

关于另一个问题的提示让我很困惑。

我做了一个练习，实际上是一个更大练习的一部分：

使用 hclust 对一些数据进行聚类（完成）
给定一个全新的向量，找出它最接近 1 中的哪个簇。

根据练习，这应该在很短的时间内完成。

然而，几周后我对这是否可以做到感到困惑，因为显然我真正从 hclust 得到的只是一棵树——而不是像我假设的那样，是一些集群。

我想我不清楚：

例如，我给 hclust 提供一个矩阵，该矩阵由 15 个 1x5 向量、5 次 (1 1 1 1 1 )、5 次 (2 2 2 2 2) 和 5 次 (3 3 3 3 3) 组成。这应该给我三个完全不同的大小为 5 的集群，任何人都可以轻松地手动完成。是否有要使用的命令，以便我可以从程序中实际找出我的 hclust-object 中有 3 个这样的集群以及它们包含的内容？

score 1 · Accepted Answer

您必须考虑定义与集群的接近度的正确指标是什么。基于 hclust 文档中的示例，这里有一种方法来计算每个集群的均值，然后测量新数据点与均值集之间的距离。

# Leave out one state
A <-USArrests
B <-A[rownames(A)!="Kentucky",]
KY <- A[rownames(A)=="Kentucky",]

# Put the B data into 10 clusters
hc   <- hclust(dist(B), "ave")
memb <- cutree(hc, k = 10)
B$cluster = memb[rownames(B)==names(memb)]

# Compute the averages over the clusters
M <-aggregate( .~cluster, data=B, FUN=mean)
M$cluster=NULL

# Now add the hold out state to the set of averages
M <-rbind(M,KY)

# Compute the distance between the clusters and the hold out state.
# This is a pretty silly way to do this but it works.
D <- as.matrix(dist(as.matrix(M),diag=TRUE,upper=TRUE))["Kentucky",]
names(D) = rownames(M)
KYclust  = which.min(D[-length(D)])
memb[memb==KYclust]

# Now cluster the full set of states and compare the results.  
hc   <- hclust(dist(A), "ave")
memb <- cutree(hc, k = 10)
a=memb[which(names(memb)=="Kentucky")]
memb[memb==a]

score 1 · Accepted Answer

与 k-means 相比，由 hclust 找到的簇可以是任意形状的。

因此，到最近的聚类中心的距离并不总是有意义的。

进行 1 最近邻样式分配可能会更好。

r - 聚类 - 如何找到离集群最近的

2 回答 2

Related

Reference