algorithm - 以距离矩阵为输入的聚类[评估]算法

Question

谁能建议一些可以使用距离矩阵作为输入的聚类算法？或者也可以基于距离矩阵评估聚类“好”的算法？

目前，我正在使用 Kruskal 算法的修改 ( http://en.wikipedia.org/wiki/Kruskal%27s_algorithm ) 将数据拆分为两个集群。它有一个问题。当数据没有不同的簇时，算法仍将创建两个簇，一个簇包含一个元素，另一个包含所有其余元素。在这种情况下，我宁愿有一个包含所有元素的集群，而另一个是空的。

是否有任何算法能够进行这种类型的聚类？

是否有任何算法可以估计聚类的完成情况，甚至更好地估计数据中有多少聚类？

该算法应仅使用距离（相似度）矩阵作为输入。

score 2 · Accepted Answer

Or the algorithm which can assess the "goodness" of the clustering also based on the distance matrix?

KNN should be useful in assessing the “goodness” of a clustering assignment. Here's how:

Given a distance matrix with each point labeled according to the cluster it belongs to (its “cluster label”):

Test the cluster label of each point against the cluster labels implied from k-nearest neighbors classification
If the k-nearest neighbors imply an alternative cluster, that classified point lowers the overall “goodness” rating of the cluster
Sum up the “goodness rating” contributions from each one of your pixels to get a total “goodness rating” for the whole cluster

Unlike k-means cluster analysis, your algorithm will return information about poorly categorized points. You can use that information to reassign certain points to a new cluster thereby improving the overall "goodness" of your clustering.

Since the algorithm knows nothing about the placement of the centroids of the clusters and hence, nothing about the global cluster density, the only way to insure clusters that are both locally and globally dense would be to run the algorithm for a range of k values and finding an arrangement that maximizes the goodness over the range of k values.

For a significant amount of points, you'll probably need to optimize this algorithm; possibly with a hash-table to keep track of the the nearest points relative to each point. Otherwise this algorithm will take quite awhile to compute.

score 1 · Accepted Answer

可用于估计集群数量的一些方法是：

score 0 · Accepted Answer

scipy.cluster.hierarchy运行 3 个步骤，就像 Matlab(TM) clusterdata 一样：

Y = scipy.spatial.distance.pdist( pts )  # you have this already
Z = hier.linkage( Y, method )  # N-1
T = hier.fcluster( Z, ncluster, criterion=criterion )

这linkage可能是一个修改过的 Kruskal，不知道。这个SO answer (ahem) 使用上述内容。
作为聚类的度量，对于 2d/3d 点，到聚类中心的半径 = rms 距离是快速且合理的。

告诉我们您的 Npt、ndim、ncluster、hier/flat 吗？聚类是一个较大的领域，一种尺寸并不适合所有人。

algorithm - 以距离矩阵为输入的聚类[评估]算法

3 回答 3

Related

Reference