python - 顺序 k 均值

Question

我可以使用以前Kmeans拟合中的 cluster_center 坐标作为 init 参数，以便在新数据到达时顺序更新 cluster_center 坐标吗？这种方法有什么缺点吗？

更新的 Scikit 在线版本学习 K-means：

KM = KMeans(n_clusters=3, random_state = 200, n_init = 1)
ni = 0

Until interrupted: 

for x in data:

    KM_updated = KM.fit(x)

    Updated_centroids(i) = KM_updated.cluster_centers_(i) + 1/len(KM_updated.labels_(i) + 1) * (x - KM_updated.cluster_centers_(i))
            
    KM = KMeans(n_clusters=3, random_state = 200, init = Updated_centroids(i), n_init = 1)

score 1 · Accepted Answer

是的，这是一个可能的解决方案。但是，您可以通过遵循此伪代码进一步改进您的实现（有关更多信息，请查看此帖子Online k-means clustering）：

Make initial guesses for the means m1, m2, ..., mk
Set the counts n1, n2, ..., nk to zero
Until interrupted
    Acquire the next example, x
    If mi is closest to x
        Increment ni
        Replace mi by mi + (1/ni)*( x - mi)
    end_if
end_until

按照这个版本的在线算法，您只需要记住每个集群的平均值和分配给集群的数据点的数量。更新这两个变量后，您可能会忘记新的数据点。

与您的相比，在此解决方案中，您不需要保留过去的数据，因此计算效率更高。

这个确切的实现在 Scikit Learn 中不可用。最接近的实现可能是带有 partial_fit 方法的MiniBatchKMeans估计器。

python - 顺序 k 均值

1 回答 1

Related

Reference