python - Cluster high dimensional data with python and DBSCAN

Question

I have a dataset with 1000 dimensions and I am trying to cluster the data with DBSCAN in Python. I have a hard time understanding what metric to choose and why.

Can someone explain this? And how should I decide what values to set eps to?

I am interested in the finer structure of the data so the min_value is set to 2. Now I use the regular metric that is preset for dbscan in sklearn, but for small eps values, such as eps < 0.07, I get a few clusters but miss many points and for larger values i get several smaller clusters and one huge. I do understand that everything depends on the data at hand but I am interested in tips on how to choose eps values in a coherent and structured way and what metrics to choose!

I have read this question and the answers there are with regards to 10 dimensions I have 1000 :) and I also do not know how to evaluate my metric so it would be interesting with a more elaborate explanation then: evaluate your metric!

Edit: Or tips on other clustering algorithms that work on high dimensional data with an existing python implementation.

score 7 · Accepted Answer

首先，minPts=2您实际上并没有进行 DBSCAN 聚类，但结果将退化为单链接聚类。

你真的应该使用minPts=10或更高。

不幸的是，您没有费心告诉我们您实际使用的距离度量！

Epsilon 确实在很大程度上取决于您的数据集和指标。如果不知道参数和您的数据集，我们将无法为您提供帮助。您是否尝试过绘制距离直方图来查看哪些值是典型值？这可能是选择此阈值的最佳启发式方法：查看距离直方图（或其样本）的分位数。

但是，请注意 OPTICS 确实去掉了这个参数（至少当你有一个正确的实现时）。使用 Xi 方法提取集群时，您只需要足够大的 epsilon 就不会切割您感兴趣的结构（并且足够小以获得您想要的运行时间 - 越大越慢，尽管不是线性的）。然后，Xi 给出了距离的相对增加，这被认为是显着的。

python - Cluster high dimensional data with python and DBSCAN

1 回答 1

Related

Reference