0

我正在ELKI's AnderbergHierarchicalClustering为我的数据集使用过度150000观察,并且对于每个观察,我使用三个变量:latlng并且price它们都是double.

我有以下问题:

  • 我的数据集大于接受的数据集(<= 65535 个观察值)
  • 这个算法也是right shiftAgnes triangle——(size * (size - 1)) >>> 1这涉及到大RAM需求

为了解决这个问题,我决定将数据集拆分为20000 obs.

因为20000 obs我需要~4.8GB RAM

我不知道以这种方式拆分数据的最佳方法是什么,即应用于子集的聚类结果将尽可能接近聚类整个集合的结果。

4

1 回答 1

1

If you use single-linkage, you can use SLINK which only needs O(n) memory.

It will still need O(n^2) time.

Hierarchical clustering is not very scalable.

CLINK can do complete linkage with O(n) memory, but the result quality is usually not very good (usually worse than Anderberg with complete linkage).

All the other algorithms we have are O(n^2) in memory, unfortunately.

Thus at around 65535 instances you will hit a wall with Java.

I have one algorithm on my todo list that should be able to run in O(n log n) if I'm not mistaken with index support. But I did not get around to study it in detail yet - it's not clear which linkage strategies it can support besides single-linkage.

If you have a lot of duplicates, make sure to merge them first!

于 2016-03-11T11:29:33.757 回答