algorithm - 代表的递增层次结构

Question

我有一个遵循这个方案的增量聚类算法：

Let x a new data-point, and c the centroid that is closest from x
if( distance(x, c) > threshold )
   x becomes a new cluster center (i.e. a new centroid)
else assign x to c (i.e. update the centroid by taking x)

为了加快从 x 搜索最近的中心，我想要一个中心的层次结构（使用树），我们可以在每次考虑新数据点时增量更新。

树的每个内部节点都表示为该节点下质心的平均值。当更新一个给定的质心时（因为一个新的数据点被分配给这个质心），我们应该重建这个质心之上的所有节点。

因此，算法变成了这样：

Let x a new data-point
c = searchClosestCenter(x, tree) // return the centroid closest to x
if( distance(x, c) > threshold )
   x becomes a new cluster center (i.e. a new centroid)
   AddCenterToTree(x, tree)
else
   assign x to c (i.e. update the centroid by taking x)
   UpdateTree(c) // update all nodes that are  on top of c

在这种情况下如何定义这个函数？有没有更好的解决方案？

score 1 · Accepted Answer

使用R-tree怎么样？它使用最小边界矩形来总结叶子页面中的对象。您也可以使用 kd-tree，但它的性能会随着时间的推移而下降（除非您重建它），因为它可能会变得不平衡。

无论如何，R-tree 对于这类数据来说是一种非常流行的数据结构。它用于 Oracle、SQLite、Postgres、MySQL...

R*-trees 是 R-tree 的改进版本。他们有一个更好的拆分策略，对插入的细微更改，并重新插入作为拆分的替代方案，以改善树的平衡。搜索是相同的。

作为一种优化，您可以通过以下优化来增强 R-tree：您还可以添加“替换”操作，而不是删除旧条目并插入新条目。您首先检查将插入新均值的位置。如果和之前是同一个页面，就在页面中替换它，最终更新边界框。

algorithm - 代表的递增层次结构

1 回答 1

Related

Reference