15

根据我对 DBSCAN 的理解,您可以指定一个 epsilon,例如 100 米,并且 - 因为 DBSCAN 在查找集群时考虑了密度可达性不是 直接密度可达性- 最终得到一个集群,其中最大距离任意两点之间的距离 > 100 米。在更极端的可能性中,您似乎可以将 epsilon 设置为 100 米并最终得到 1 公里的集群: 请参阅 scikit learn 的此图像数组中的 [2][6] 以了解何时发生这种情况的示例. (我非常愿意被告知我是一个彻头彻尾的白痴,如果这就是这里发生的事情,我会误解 DBSCAN。)

是否有像 DBSCAN 这样基于密度的算法,但考虑到集群中任意两点之间的最大距离的某种阈值?

4

1 回答 1

18

DBSCAN indeed does not impose a total size constraint on the cluster.

The epsilon value is best interpreted as the size of the gap separating two clusters (that may at most contain minpts-1 objects).

I believe, you are in fact not even looking for clustering: clustering is the task of discovering structure in data. The structure can be simpler (such as k-means) or complex (such as the arbitrarily shaped clusters discovered by hierarchical clustering and k-means).

You might be looking for vector quantization - reducing a data set to a smaller set of representatives - or set cover - finding the optimal cover for a given set - instead.

However, I also have the impression that you aren't really sure on what you need and why.

A stength of DBSCAN is that it has a mathematical definition of structure in the form of density-connected components. This is a strong and (except for some rare border cases) well-defined mathematical concept, and the DBSCAN algorithm is an optimally efficient algorithm to discover this structure.

Direct density reachability however, doesn't define a useful (partitioning) structure. It just does not partition the data into disjoint partitions.

If you don't need this kind of strong structure (i.e. you don't do clustering as in "structure discovery", but you just want to compress your data as in vector quantization), you could give "canopy preclustering" a try. It can be seen as a preprocessing step designed for clustering. Essentially, it is like DBSCAN, except that it uses two epsilon values, and the structure is not guaranteed to be optimal in any way, but will highly depend on the ordering of your data. If you then preprocess it appropriately, it can still be useful. Unless you are in a distributed setting, canopy preclustering however is at least as expensive than a full DBSCAN run. Due to the loose requirements (in particular, "clusters" may overlap, and objects are expected to belong to multiple "clusters"), it is easier to parallelize.

Oh, and you might also just be looking for complete-linkage hierarchical clustering. If you cut the dendrogram at your desired height, the resulting clusters should all have the desired maximum distance inbetween of any two objects. The only problem is that hierarchical clustering usually is O(n^3), i.e. it doesn't scale to large data sets. DBSCAN runs in O(n log n) in good implementations (with index support).

于 2013-08-31T21:47:20.073 回答