If you use single-linkage
, you can use SLINK
which only needs O(n)
memory.
It will still need O(n^2)
time.
Hierarchical clustering
is not very scalable.
CLINK
can do complete linkage
with O(n)
memory, but the result quality is usually not very good (usually worse than Anderberg
with complete linkage
).
All the other algorithms we have are O(n^2)
in memory, unfortunately.
Thus at around 65535 instances you will hit a wall with Java
.
I have one algorithm on my todo list that should be able to run in O(n log n)
if I'm not mistaken with index support
. But I did not get around to study it in detail yet - it's not clear which linkage strategies
it can support besides single-linkage
.
If you have a lot of duplicates, make sure to merge them first!