1

我正在尝试使用 ELKI 库实现 DBSCAN 集群测试应用程序。我的数据集是 6 维的,由大约 100.000 个对象组成。

我曾尝试在我的代码中使用 R*-Tree ELKI 优化,但对代码进行基准测试似乎仍然使用 O(n^2)。

这是我在应用程序中使用的代码:

ListParameterization dbscanParams = new ListParameterization();
dbscanParams.addParameter(DBSCAN.Parameterizer.EPSILON_ID, eps);
dbscanParams.addParameter(DBSCAN.Parameterizer.MINPTS_ID, minPts);
dbscanParams.addParameter(DBSCAN.DISTANCE_FUNCTION_ID, EuclideanDistanceFunction.class);

DBSCAN<DoubleVector, DoubleDistance> dbscan = ClassGenericsUtil.parameterizeOrAbort(DBSCAN.class, dbscanParams);

ArrayAdapterDatabaseConnection arrayAdapterDatabaseConnection = new ArrayAdapterDatabaseConnection(featuresMatrix, featuresLabels);

ListParameterization dbparams = new ListParameterization();
dbparams.addParameter(AbstractDatabase.Parameterizer.INDEX_ID, RStarTreeFactory.class);
dbparams.addParameter(RStarTreeFactory.Parameterizer.BULK_SPLIT_ID, SortTileRecursiveBulkSplit.class);
dbparams.addParameter(AbstractDatabase.Parameterizer.DATABASE_CONNECTION_ID, arrayAdapterDatabaseConnection);
dbparams.addParameter(AbstractPageFileFactory.Parameterizer.PAGE_SIZE_ID, pageSize);

Database db = ClassGenericsUtil.parameterizeOrAbort(StaticArrayDatabase.class, dbparams);

db.initialize();

Clustering<Model> result = dbscan.run(db);

运行上面的代码会导致以下结果:

| NUM_OBJECTS |  TIME(ms)  |
|-------------|------------|
| 4444        |  1508      |
| 8887        |  5547      |
| 17768       |  23401     |
| 35536       |  103733    |
| 71040       |  426494    |
| 142080      |  1801652   |

使用简单的 System.currentTimeMillis() 围绕 dbscan.run(db) 对时间进行基准测试。查看时间列,您可以看到趋势类似于 n^2 而不是 nlog(n),但我无法理解将 ELKI DBSCAN 与 R*-Tree 优化一起使用时缺少什么。

感谢您的任何帮助或建议。

4

1 回答 1

1

If you choose the query radius epsilon too large, every object will have O(n) neighbors.

Then the runtime will be O(n^2) or worse, even with index support; because the answer size of each query is O(n).

If you choose epsilon such that on average 10% of the objects will be within the epsilon radius, then your runtime will be at least O(n * 10% * n), which is O(n^2).

Which shows nicely how a theoretical runtime of O(n log n) may not give you a runtime of O(n log n) in practice. A R*-tree can answer radius or kNN queries in O(log n) on average - for small answer sets, where the answer set size can be neglected. A more precise analysis would likely yield a runtime of O(log n + |answer| log |answer|) (because we sort the answers by distance currently; we could remove this for some algorithms).

More often than not, an algorithm assumed to be O(n*n) will cost you O(n*n log n) runtime, because for every object you sort all others by distance. Fortunately, sorting is well optimized, so the extra log n does not matter much.

于 2014-07-18T17:02:07.640 回答