cluster-analysis - 集群合并阈值

Question

我正在使用均值偏移，此过程计算数据集中每个点的收敛位置。我还可以计算两个不同点收敛的坐标之间的欧几里得距离，但我必须给出一个阈值，也就是说，如果（距离 < 阈值）那么这些点属于同一个簇，我可以合并它们。

如何找到用作阈值的正确值？
（我可以使用每个值，这取决于结果，但我需要最佳值）

score 0 · Accepted Answer

我已经多次实施均值偏移聚类并遇到了同样的问题。根据您愿意为每个点移动多少次迭代，或者您的终止标准是什么，通常有一些后处理步骤，您必须将移动的点分组到集群中。理论上转移到相同模式的点实际上不需要直接在彼此之上结束。

我认为最好和最通用的方法是使用基于内核带宽的阈值，如评论中所建议的那样。过去，我执行此后处理的代码通常如下所示：

threshold = 0.5 * kernel_bandwidth
clusters = []
for p in shifted_points:
    cluster = findExistingClusterWithinThresholdOfPoint(p, clusters, threshold)
    if cluster == null:
        // create new cluster with p as its first point
        newCluster = [p]
        clusters.add(newCluster)
    else:
        // add p to cluster
        cluster.add(p)

对于该功能，我通常使用到每个当前定义的集群findExistingClusterWithinThresholdOfPoint的最小距离。p

这似乎工作得很好。希望这可以帮助。

cluster-analysis - 集群合并阈值

1 回答 1

Related

Reference