algorithm - 用于数据挖掘的 DBSCAN 算法和聚类算法

Question

您如何在分类数据（蘑菇数据集）上实现 DBSCAN 算法？

什么是一次性聚类算法？

您能否提供一次性聚类算法的伪代码？

score 1 · Accepted Answer

You can run DBSCAN with an arbitrary distance function without any changes to it. The indexing part will be more difficult, so you will likely only get O(n^2) complexity.

But if you look closely at DBSCAN, all it does is compute distances, compare them to a threshold, and count objects. This is a key strength of it, it can easily be applied to various kinds of data, all you need is to define a distance function and thresholds.

I doubt there is a one-pass version of DBSCAN, as it relies on pairwise distances. You can prune some of these computations (this is where the index comes into play), but essentially you need to compare every object to every other object, so it is in O(n log n) and not one-pass.

One-pass: I believe the original k-means was a one-pass algorithm. The first k objects are your initial means. For every new object, you choose the closes mean and update it (incrementally) with the new object. As long as you don't do another iteration over your data set, this was "one-pass". (The result will be even worse than lloyd-style k-means though).

score -2 · Accepted Answer

阅读前 k 个项目并持有它们。计算它们之间的距离。

对于每个剩余的项目：

找出它最接近 k 个项目中的哪一个，以及这两个项目之间的距离。
如果这比 k 项中任意两项之间的最近距离长，则可以将新项与这两项中的一项进行交换，并且至少不会减少任何两项新 k 项之间的最近距离。这样做是为了尽可能地增加这个距离。

假设所有项目的集合可以划分为 l <= k 个簇，使得同一簇中任意两点之间的距离小于不同簇中任意两点之间的距离。然后在运行此算法后，您将至少保留每个集群中的一个点。

algorithm - 用于数据挖掘的 DBSCAN 算法和聚类算法

2 回答 2

Related

Reference