7

I have a dataset with 1000 dimensions and I am trying to cluster the data with DBSCAN in Python. I have a hard time understanding what metric to choose and why.

Can someone explain this? And how should I decide what values to set eps to?

I am interested in the finer structure of the data so the min_value is set to 2. Now I use the regular metric that is preset for dbscan in sklearn, but for small eps values, such as eps < 0.07, I get a few clusters but miss many points and for larger values i get several smaller clusters and one huge. I do understand that everything depends on the data at hand but I am interested in tips on how to choose eps values in a coherent and structured way and what metrics to choose!

I have read this question and the answers there are with regards to 10 dimensions I have 1000 :) and I also do not know how to evaluate my metric so it would be interesting with a more elaborate explanation then: evaluate your metric!

Edit: Or tips on other clustering algorithms that work on high dimensional data with an existing python implementation.

4

1 回答 1

7

首先,minPts=2您实际上并没有进行 DBSCAN 聚类,但结果将退化为单链接聚类

你真的应该使用minPts=10或更高。

不幸的是,您没有费心告诉我们您实际使用的距离度量!

Epsilon 确实在很大程度上取决于您的数据集和指标。如果不知道参数和您的数据集,我们将无法为您提供帮助。您是否尝试过绘制距离直方图来查看哪些值是典型值?这可能是选择此阈值的最佳启发式方法:查看距离直方图(或其样本)的分位数。

但是,请注意 OPTICS 确实去掉了这个参数(至少当你有一个正确的实现时)。使用 Xi 方法提取集群时,您只需要足够大的 epsilon 就不会切割您感兴趣的结构(并且足够小以获得您想要的运行时间 - 越大越慢,尽管不是线性的)。然后,Xi 给出了距离的相对增加,这被认为是显着的。

于 2013-04-22T15:35:39.140 回答