r - 了解 r 中的 kmeans 聚类

Question

下面的代码（减去我的问题）生成此图：

在此处输入图像描述

我用“->”标记了 4 个混淆区域

> m <- matrix(c(1,1,1) , ncol=3)
> 
> x <- rbind(matrix(c(1,0,1) , ncol=3),
+            matrix(c(1,1,1) , ncol=3),
+            matrix(c(1,1,0) , ncol=3),
+            matrix(c(0,1,1) , ncol=3),
+            matrix(c(0,0,1) , ncol=3),
+            matrix(c(0,0,0) , ncol=3),
+            matrix(c(1,1,1) , ncol=3),
+            matrix(c(1,1,1) , ncol=3),
+            matrix(c(1,1,0) , ncol=3),
+            matrix(c(1,0,0) , ncol=3),
+            matrix(c(0,0,1) , ncol=3),
+            matrix(c(0,0,0) , ncol=3),
+            matrix(c(0,0,1) , ncol=3),
+            matrix(c(0,1,1) , ncol=3),
+            matrix(c(1,0,1) , ncol=3),
+            matrix(c(0,1,0) , ncol=3))
> colnames(x) <- c("google", "stackoverflow", "tester")
> (cl <- kmeans(x, 3))

K-means clustering with 3 clusters of sizes 3, 10, 3
-> Where are sizes 3, 10 3 appearing  ?

Cluster means:
     google stackoverflow tester
1 0.6666667           1.0      0
2 0.5000000           0.5      1
3 0.3333333           0.0      0

-> There are three clusters, but what does each number signify ?

Clustering vector:
 [1] 2 2 1 2 2 3 2 2 1 3 2 3 2 2 2 1

-> This looks to be created by summing the values of each matrix but seems to be unordered as second element in this vector is '2' but second element in 'x' is matrix(c(1,1,1) , ncol=3) which is '3'

Within cluster sum of squares by cluster:
[1] 0.6666667 5.0000000 0.6666667
 (between_SS / total_SS =  46.1 %)

-> what are between_SS & total_SS ?

Available components:

[1] "cluster"      "centers"      "totss"        "withinss"     "tot.withinss"
[6] "betweenss"    "size"        
> plot(x, col = cl$cluster)
> points(cl$centers, col = 1:5, pch = 8, cex = 2)
>

可以通过阅读该算法的实现（http://en.wikipedia.org/wiki/K-means_clustering）来提供这些问题的答案我看不到 r 是如何计算这些值的

score 3 · Accepted Answer

1. 集群大小是什么意思？

您提供了 16 条记录并告诉 kmeans 查找 3 个集群。它将这 16 条记录聚类为 3 组 A：3 条记录，B：10 条记录和 C：3 条记录。

2.集群是什么意思？

这些数字表示每个簇的质心（“平均值”）在 N 维空间中的位置。你有三个集群，所以你有三个手段。您有三个维度（“google”、“stackoverflow”、“tester”），因此您可以在每个维度中获得一个值。跨行读取数字给出了单个质心的位置。

3.什么是聚类向量？

这是算法赋予您通过算法的每条记录的集群标签。还记得我之前说过有 3 个大小为 3、10 和 3 的集群吗？这些簇被标记为 1、2 和 3，算法将每个记录的簇标签存储在该向量中。在这里，您可以看到有 3 个“1”、10 个“2”和 3 个“3”。那有意义吗？

4. between_SS & total_SS 是什么？

这是 ANOVA 中通常使用的符号。您可能会发现这很有帮助：http ://www-ist.massey.ac.nz/dstirlin/CAST/CAST/HrandBlock/randBlock7.html

r - 了解 r 中的 kmeans 聚类

1 回答 1

Related

Reference