4

目标

我试图得到:

  1. 近似最近邻库 FLANN,以及
  2. python绑定pyflann

在运行 Ubuntu 的 AWS ec2 实例上正常工作。我的目标是将 FLANN 与其他 ANN 实现进行比较,例如 ANNOY 和 scikit-learn ANN 实现,看看哪一个最适合我工作的公司。我们正在处理数百万个尺寸约为 500 的向量。

出于这个原因,让 FLANN 本身工作对我来说很重要,而不是接受关于替代 ANN 实施的建议。我知道 Radim Rehurek 的不错博文,但是我们有一个具体的数据集,我们想在其上检查各种 ANN 算法的性能,所以他的博客并没有消除我们自己进行基准测试的需要数据。

问题

我已经成功安装了 flann 和 pyflann 的版本,但是当被要求使用“kmeans”参数创建 ANN 索引时,pyflann 返回无意义的结果。例如,考虑以下 python 代码及其输出:

>>> from pyflann import *
>>> from numpy import *
>>> from numpy.random import *
>>> dataset = rand(1000, 100)
>>> testset = rand(10, 100)
>>> flann = FLANN()
>>> result,dists = flann.nn(dataset,testset, 5, algorithm="kmeans")
>>> print result
[[ -278697864       32687  -278697864       32687  1677721700]
 [   40632322           6    16778074  1677721700           9]
 [     285184  1509950821          12       25600  1811940196]
 [         15   426661632   140837888          18    16801138]
 [   16779610          21    23986182   107304960          24]
 [-2080373660   190447616          27  1694501978   224002059]
 [         30  1694502490   257556491          33 -2080373404]
 [  207224832          36  1509949572          49           0]
 [   43668848           0  -278698024       32687     8650760]
 [    1006080  1392509796  1397948499         208           0]]
>>>

由于该行:

result,dists = flann.nn(dataset,testset, 5, algorithm="kmeans")

正在为“testset”中的十个 100 维向量中的每一个请求五个邻居,输出的数组具有正确的维度:十行对应于“testset”中的十个向量,每行的长度为 5,反映了事实我问了五个邻居。但是,条目的值不可能是正确的,因为有些是负数,而且许多超出了 0 到 999 的范围,即可能最近邻居的索引范围。为了比较,这是我的终端的输出,使用与上面几乎相同的代码,仅将“kmeans”更改为“kdtree”:

>>> from pyflann import *
>>> from numpy import *
>>> from numpy.random import *
>>> dataset = rand(1000, 100)
>>> testset = rand(10, 100)
>>> flann = FLANN()
>>> result,dists = flann.nn(dataset,testset, 5, algorithm="kdtree")
>>> print result
[[189 363 397 723 685]
 [400 952 892 332 477]
 [560 959 295 591 394]
 [596 652 250  43 448]
 [498 706 543 761 323]
 [334 974 591 620 766]
 [435 386  58 962 421]
 [234 301 189 355 191]
 [857 133 420 544 612]
 [978 995 439 648 627]]
>>>

这一次,所有条目都是 0 到 999 之间的非负整数,正如预期的那样。当然,数据是随机生成的,因此结果会有所不同,但是使用“kmeans”参数会产生始终如一的愚蠢结果,而“kdtree”会产生始终合理的结果。

软件和操作系统详细信息

(0) Ubuntu 发行版:

Ubuntu 14.04 LTS

(1) libflann-dev:

打字:

sudo aptitude show libflann-dev

产生:

Package: libflann-dev
State: installed
Automatically installed: no
Version: 1.8.4-3
Priority: optional
Section: universe/libdevel
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Uncompressed Size: 11.2 M
Depends: libflann1.8 (= 1.8.4-3)
Description: Fast Library for Approximate Nearest Neighbors - development
 FLANN is a library for performing fast approximate nearest neighbor searches in high dimensional spaces. It contains a collection of algorithms found to work best for
 nearest neighbor search and a system for automatically choosing the best algorithm and optimum parameters depending on the dataset.

 This package contains development files needed to build FLANN applications.
Homepage: http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN

(2)打字:

sudo aptitude show python

产生:

Package: python
State: installed
Automatically installed: no
Multi-Arch: allowed
Version: 2.7.5-5ubuntu3
Priority: optional
Section: python
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Uncompressed Size: 687 k
Depends: python2.7 (>= 2.7.5-1~), python-minimal (= 2.7.5-5ubuntu3), libpython-stdlib (= 2.7.5-5ubuntu3)
Suggests: python-doc (= 2.7.5-5ubuntu3), python-tk (>= 2.7.5-1~)
Conflicts: python-central (< 0.5.5)
Breaks: python-bz2 (< 1.1-8), python-csv (< 1.0-4), python-email (< 2.5.5-3), update-manager-core (< 0.200.5-2)
Replaces: python-dev (< 2.6.5-2)
Provides: python-ctypes, python-email, python-importlib, python-profiler, python-wsgiref, python:any
Description: interactive high-level object-oriented language (default version)
 Python, the high-level, interactive object oriented language, includes an extensive class library with lots of goodies for network programming, system administration,
 sounds and graphics.

 This package is a dependency package, which depends on Debian's default Python version (currently v2.7).
Homepage: http://www.python.org/

安装方法

我首先尝试使用以下命令安装 FLANN:

sudo apt-get install libflann1.8

安装 pyflann 后:

sudo pip install -e git+git://github.com/Captricity/pyflann.git#egg=pyflann,

我输入:

python -c 'import pyflann'

并收到错误消息:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "/mnt/working/src/pyflann/pyflann/__init__.py", line 27, in <module>
    from index import *
  File "/mnt/working/src/pyflann/pyflann/index.py", line 27, in <module>
    from bindings.flann_ctypes import *
  File "/mnt/working/src/pyflann/pyflann/bindings/__init__.py", line 30, in <module>
    from flann_ctypes import *
  File "/mnt/working/src/pyflann/pyflann/bindings/flann_ctypes.py", line 169, in <module>
    raise ImportError('Cannot load dynamic library. Did you compile FLANN?')
ImportError: Cannot load dynamic library. Did you compile FLANN?

然后,在一个新的 ec2 实例上,我输入:

sudo apt-get install libflann-dev
sudo pip install -e git+git://github.com/Captricity/pyflann.git#egg=pyflann

跑了

python -c 'import pyflann'

毫无怨言。但是,我有上面描述的“kmeans”问题。

笔记

我已经成功地在我的 MacBookPro 上安装了 FLANN 和 pyflann,并且一切正常——即使使用“kmeans”作为最近邻查询参数也会产生合理的结果。

4

0 回答 0