cluster-analysis - Python Clustering Algorithms

Question

I've been looking around scipy and sklearn for clustering algorithms for a particular problem I have. I need some way of characterizing a population of N particles into k groups, where k is not necessarily know, and in addition to this, no a priori linking lengths are known (similar to this question).

I've tried kmeans, which works well if you know how many clusters you want. I've tried dbscan, which does poorly unless you tell it a characteristic length scale on which to stop looking (or start looking) for clusters. The problem is, I have potentially thousands of these clusters of particles, and I cannot spend the time to tell kmeans/dbscan algorithms what they should go off of.

Here is an example of what dbscan find: dbscanfail

You can see that there really are two separate populations here, though adjusting the epsilon factor (the max. distance between neighboring clusters parameter), I simply cannot get it to see those two populations of particles.

Is there any other algorithms which would work here? I'm looking for minimal information upfront - in other words, I'd like the algorithm to be able to make "smart" decisions about what could constitute a separate cluster.

score 8 · Accepted Answer

我找到了一个不需要先验信息/猜测并且对我要求它做的事情做得很好。它称为Mean Shift，位于SciKit-Learn中。它也相对较快（与亲和传播等其他算法相比）。

这是它给出的一个例子：

均值偏移结果

我还想指出，在文档中指出它可能无法很好地扩展。

score 3 · Accepted Answer

使用 DBSCAN 时，事先对数据或距离进行缩放/归一化会很有帮助，因此对 epsilon 的估计将是相对的。
有一个 DBSCAN 的实现——我认为它是一个 Anony-Mousse，在某处被称为“浮动”——它带有一个 epsilon 估计函数。只要它没有输入大型数据集，它就可以工作。
github 上有几个OPTICS 的不完整版本。也许你可以找到一个来适应你的目的。仍在尝试使用一种相同的提取方法弄清楚 minPts 有什么影响。

score 1 · Accepted Answer

您可以尝试最小生成树（zahn 算法），然后删除类似于 alpha 形状的最长边。我将它与 delaunay 三角测量和凹形船体一起使用：http ://www.phpdevpad.de/geofence 。您还可以尝试分层集群，例如 clusterfck。

score 1 · Accepted Answer

您的绘图表明您选择的minPts参数方式太小。

看看 OPTICS，它不再需要 DBSCAN 的 epsilon 参数。

cluster-analysis - Python Clustering Algorithms

4 回答 4

Related

Reference