r - 层次聚类：确定最佳聚类数并统计描述聚类

Question

我可以对 R 中的方法使用一些建议来确定最佳集群数量，然后用不同的统计标准描述集群。我是 R 新手，对聚类分析的统计基础有基本的了解。

确定集群数量的方法：在文献中，一种常用的方法是所谓的“肘部标准”，它比较不同集群解决方案的平方差之和 (SSD)。因此，SSD 是针对分析中的集群数量绘制的，并且通过识别图中的“弯头”来确定最佳集群数量（例如，此处：https://en.wikipedia.org/wiki/File:DataClustering_ElbowCriterion。 JPG ) 这种方法是获得主观印象的第一种方法。因此我想在 R 中实现它。互联网上关于这方面的信息很少。这里有一个很好的例子：http: //www.mattpeeples.net/kmeans.html作者还做了一个有趣的迭代方法，以查看在多次重复聚类过程后肘部是否稳定（尽管它用于划分聚类方法而不是分层）。文献中的其他方法包括所谓的“停止规则”。MILLIGAN & COOPER 在他们的论文“An 检查用于确定数据集中集群数量的程序”（可在此处获取：http://link.springer.com/article/10.1007%2FBF02294245）中比较了其中的 30 条停止规则，发现Calinski 和 Harabasz 的停止规则在蒙特卡洛评估中提供了最好的结果。在 R 中实现这一点的信息甚至更少。因此，如果有人曾经实施过这个或另一个停止规则（或其他方法），一些建议会非常有帮助。
统计描述集群：为了描述集群，我想到了使用均值和某种方差标准。我的数据是关于农业用地的，显示了每个城市不同作物的产量。我的目标是在我的数据集中找到类似的土地利用模式。

我为对象子集制作了一个脚本来进行第一次测试运行。它看起来像这样（脚本中的步骤说明，以下来源）。

    #Clusteranalysis agriculture

    #Load data
    agriculture <-read.table ("C:\\Users\\etc...", header=T,sep=";")
    attach(agriculture)

    #Define Dataframe to work with
    df<-data.frame(agriculture)

    #Define a Subset of objects to first test the script
    a<-df[1,]
    b<-df[2,]
    c<-df[3,]
    d<-df[4,]
    e<-df[5,]
    f<-df[6,]
    g<-df[7,]
    h<-df[8,]
    i<-df[9,]
    j<-df[10,]
    k<-df[11,]
    #Bind the objects
    aTOk<-rbind(a,b,c,d,e,f,g,h,i,j,k)

    #Calculate euclidian distances including only the columns 4 to 24
    dist.euklid<-dist(aTOk[,4:24],method="euclidean",diag=TRUE,upper=FALSE, p=2)
    print(dist.euklid)

    #Cluster with Ward
    cluster.ward<-hclust(dist.euklid,method="ward")

    #Plot the dendogramm. define Labels with labels=df$Geocode didn't work
    plot(cluster.ward, hang = -0.01, cex = 0.7)

    #here are missing methods to determine the optimal number of clusters

    #Calculate different solutions with different number of clusters
    n.cluster<-sapply(2:5, function(n.cluster)table(cutree(cluster.ward,n.cluster)))
    n.cluster

    #Show the objects within clusters for the three cluster solution
    three.cluster<-cutree(cluster.ward,3)
    sapply(unique(three.cluster), function(g)aTOk$Geocode[three.cluster==g])

    #Calculate some statistics to describe the clusters
    three.cluster.median<-aggregate(aTOk[,4:24],list(three.cluster),median)
    three.cluster.median
    three.cluster.min<-aggregate(aTOk[,4:24],list(three.cluster),min)
    three.cluster.min
    three.cluster.max<-aggregate(aTOk[,4:24],list(three.cluster),max)
    three.cluster.max
    #Summary statistics for one variable
    three.cluster.summary<-aggregate(aTOk[,4],list(three.cluster),summary)
    three.cluster.summary

    detach(agriculture)

资料来源：

score 9 · Accepted Answer

如您的链接所示，肘部标准适用于k-means。此外，聚类均值显然与 k-means 相关，不适用于链接聚类（尤其不适用于单链接，请参阅单链接效应）。

但是，您的问题标题提到了层次聚类，您的代码也是如此吗？

请注意，肘部标准不会选择最佳聚类数。它选择最佳数量的k-means 聚类。如果您使用不同的聚类方法，则可能需要不同数量的聚类。

没有客观上最好的聚类这样的东西。因此，也没有客观上最好的集群数量。k-means 有一个经验法则，它在集群数量和最小化目标函数之间选择（也许是最好的）权衡（因为增加集群的数量总是可以改善目标函数）；但这主要是为了应对 k-means 的不足。这绝不是客观的。

聚类分析本身并不是一项客观的任务。聚类可能在数学上很好，但没用。聚类可能在数学上得分更差，但它可以让您深入了解无法以数学方式衡量的数据。

score 7 · Accepted Answer

这是一个非常晚的答案，可能不再对提问者有用 - 但可能对其他人有用。查看包 NbClust。它包含 26 个索引，可为您提供建议的集群数量（您也可以选择集群类型）。您可以以这样的方式运行它，以便获得所有索引的结果，然后您基本上可以使用大多数索引推荐的集群数量。是的，我认为基本统计数据是描述集群的最佳方式。

score 1 · Accepted Answer

1

您也可以尝试 R-NN 曲线方法。 http://rguha.net/writing/pres/rnn.pdf

于 2013-04-11T21:57:11.853 回答

score 0 · Accepted Answer

K 表示聚类对数据的规模高度敏感，例如对于一个人的年龄和薪水，如果未标准化，K 意味着将薪金视为更重要的聚类变量，而不是年龄，这是您不想要的。因此，在应用聚类算法之前，将数据规模标准化，将它们置于同一水平，然后应用 CA，这始终是一个很好的做法。

r - 层次聚类：确定最佳聚类数并统计描述聚类

4 回答 4

Related

Reference