algorithm - Best data structure for high dimensional nearest neighbor search

Question

I'm actually working on high dimensional data (~50.000-100.000 features) and nearest neighbors search must be performed on it. I know that KD-Trees has poor performance as dimensions grows, and also I've read that in general, all space-partitioning data structures tends to perform exhaustive search with high dimensional data.

Additionally, there are two important facts to be considered (ordered by relevance):

Precision: The nearest neighbors must be found (not approximations).
Speed: The search must be as fast as possible. (The time to create the data structure isn't really important).

So, I need some advice about:

The data structure to perform k-NN.
If it will be better to use an aNN (approximate nearest neighbor) approach, setting it as accurate as possible?.

score 2 · Accepted Answer

我可以在高维空间中执行 NN 搜索吗？

不会。由于维度灾难，最近邻搜索的数据结构在低维度上表现良好，但在高维度上表现不佳。事实上，查询次数变得几乎与蛮力一样，因此一文不值。

因此，在高维空间中，应该进行近似最近邻(ANN) 搜索。老实说，这是必须的。

执行人工神经网络的数据结构是什么？

我会建议 LSH，或一些 RKD 树。在我的回答中，我提到了一些在 C++ 中执行 ANN 的优秀库。但是，请注意 LSH 解决了 R-最近邻问题，因此您指定了参数 R，它实际上是半径。然后，LSH 将从查询点开始在该 R 中查找 NN，因此您不能真正请求k NN。

另一方面，RKD 树可以做到这一点并返回k NN。我有一个项目，它构建了一个 RKD 树森林并在 C++ 中执行 ANN 搜索，但它只针对高维度。它可以在 < 1 秒内处理 960 维的 10^6 图像的 GIST 数据集，其中大约 90% 的输出是真正的最近邻。名称是kd-GeRaF。它将在下个月更新为分布式版本，但它已经过测试并可以使用。它还有一个可爱的标志。:)

我也觉得您应该阅读我的答案，其中说最佳数据结构取决于数据。

score 0 · Accepted Answer

我认为在如此高维数据中进行聚类是不明智的。存在维度问题的诅咒。

随着维数的增加，距离的概念变得不那么精确，因为给定数据集中任意两点之间的距离会收敛

我建议你找到一个很好的距离度量，而不是在高维空间上直接欧几里得距离。

本页列出了一些可能的解决方案， https://en.wikipedia.org/wiki/Clustering_high-dimensional_data

2.1 子空间聚类

2.2 投影聚类

2.3 混合方法

2.4 相关聚类

algorithm - Best data structure for high dimensional nearest neighbor search

2 回答 2

Related

Reference