r - 如何将不同方法的集群标签与 r 中的实际标签相匹配？

Question

基本上，我模拟了 1000 个数据集，然后通过不同的聚类技术对它们进行聚类，例如：k-means、基于模型的聚类等。

然后，我可以使用分类正确率 CCR 验证方法的性能。但是，我面临标签切换问题，因此无法获得真实的 CCR。那么，我的问题是，有没有办法统一 r 中的所有标签以用于多元数据集？

这是一个简单的例子：

  # Create the random data sets:

  data1 <- rnorm(5, 0, 0.5) # cluster 1

  data2 <- rnorm(5, 2, 0.5) # cluster 2

  data3 <- rnorm(5, 4, 0.5) # cluster 3

  alldata <- c(data1, data2, data3)

  # cluster the data using different methods:

  require(cluster)

  km.method <- kmeans(alldata, centers = 3)$cluster
  # [1] 3 3 3 3 3 1 1 1 1 1 2 2 2 2 2

  pam.method <- pam(alldata, 3)$clustering
  # [1] 1 1 1 1 1 2 2 2 2 2 3 3 3 3 3


  # As you see the answers are exactly the same, but the labels are different! 
  # How I can unify the labels for all methods to match the true labels??

score 0 · Accepted Answer

CCR 不是聚类的合适度量。

由于集群器不提供类，因此它的定义为 0。

考虑 Iris 数据集。正确的类别是物种。像 k-means 这样的聚类会产生“标签”0、1、2。这些都不是正确的。

评估聚类的正确方法是使用聚类评估度量，例如调整后的兰德指数和归一化互信息。这些评估集合重叠，而不是单个标签。

r - 如何将不同方法的集群标签与 r 中的实际标签相匹配？

1 回答 1

Related

Reference