r - R中的k均值返回值

Question

我在 R 中使用 kmeans() 函数，我很好奇返回对象的totss和tot.withinss属性之间有什么区别。从文档来看，他们似乎返回了同样的东西，但在我的数据集上应用 totss 的值为 66213.63，tot.withinss 的值为 6893.50。如果您熟悉 mroe 详细信息，请告诉我。谢谢！

马吕斯。

score 20 · Accepted Answer

给定betweenss每个簇withinss的平方和内的向量和平方和内的向量，公式如下：

totss = tot.withinss + betweenss
tot.withinss = sum(withinss)

例如，如果只有一个簇，那么betweenss是，则和0中只有一个组件。withinsstotss = tot.withinss = withinss

为了进一步澄清，我们可以在给定集群分配的情况下自己计算这些不同的数量，这可能有助于澄清它们的含义。考虑示例中的数据x和集群分配。如下定义平方和函数 - 这会从该列中减去 x 的每一列的平均值，然后是剩余矩阵的每个元素的平方和：cl$clusterhelp(kmeans)

# or ss <- function(x) sum(apply(x, 2, function(x) x - mean(x))^2)
ss <- function(x) sum(scale(x, scale = FALSE)^2)

然后我们有以下内容。注意cl$centers[cl$cluster, ]是拟合值，即它是一个矩阵，每个点有一行，因此第 i 行是第 i 个点所属的簇的中心。

example(kmeans) # create x and cl

betweenss <- ss(cl$centers[cl$cluster,]) # or ss(fitted(cl))

withinss <- sapply(split(as.data.frame(x), cl$cluster), ss)
tot.withinss <- sum(withinss) # or  resid <- x - fitted(cl); ss(resid)

totss <- ss(x) # or tot.withinss + betweenss

cat("totss:", totss, "tot.withinss:", tot.withinss, 
  "betweenss:", betweenss, "\n")

# compare above to:

str(cl)

编辑：

自从回答了这个问题后，R 已经添加了额外的类似kmeans示例（example(kmeans)fitted.kmeans

score 0 · Accepted Answer

我认为您在文档中发现了一个错误...其中说：

withinss     The within-cluster sum of squares for each cluster.
totss        The total within-cluster sum of squares.
tot.withinss     Total within-cluster sum of squares, i.e., sum(withinss).

如果您使用帮助页面示例中的示例数据集：

> kmeans(x,2)$tot.withinss
[1] 15.49669
> kmeans(x,2)$totss
[1] 65.92628
> kmeans(x,2)$withinss
[1] 7.450607 8.046079

我认为应该有人向 r-devel 邮件列表写一个请求，要求修改帮助页面。如果你不愿意，我愿意这样做。

r - R中的k均值返回值

2 回答 2

Related

Reference