目标
我试图得到:
- 近似最近邻库 FLANN,以及
- python绑定pyflann
在运行 Ubuntu 的 AWS ec2 实例上正常工作。我的目标是将 FLANN 与其他 ANN 实现进行比较,例如 ANNOY 和 scikit-learn ANN 实现,看看哪一个最适合我工作的公司。我们正在处理数百万个尺寸约为 500 的向量。
出于这个原因,让 FLANN 本身工作对我来说很重要,而不是接受关于替代 ANN 实施的建议。我知道 Radim Rehurek 的不错的博文,但是我们有一个具体的数据集,我们想在其上检查各种 ANN 算法的性能,所以他的博客并没有消除我们自己进行基准测试的需要数据。
问题
我已经成功安装了 flann 和 pyflann 的版本,但是当被要求使用“kmeans”参数创建 ANN 索引时,pyflann 返回无意义的结果。例如,考虑以下 python 代码及其输出:
>>> from pyflann import *
>>> from numpy import *
>>> from numpy.random import *
>>> dataset = rand(1000, 100)
>>> testset = rand(10, 100)
>>> flann = FLANN()
>>> result,dists = flann.nn(dataset,testset, 5, algorithm="kmeans")
>>> print result
[[ -278697864 32687 -278697864 32687 1677721700]
[ 40632322 6 16778074 1677721700 9]
[ 285184 1509950821 12 25600 1811940196]
[ 15 426661632 140837888 18 16801138]
[ 16779610 21 23986182 107304960 24]
[-2080373660 190447616 27 1694501978 224002059]
[ 30 1694502490 257556491 33 -2080373404]
[ 207224832 36 1509949572 49 0]
[ 43668848 0 -278698024 32687 8650760]
[ 1006080 1392509796 1397948499 208 0]]
>>>
由于该行:
result,dists = flann.nn(dataset,testset, 5, algorithm="kmeans")
正在为“testset”中的十个 100 维向量中的每一个请求五个邻居,输出的数组具有正确的维度:十行对应于“testset”中的十个向量,每行的长度为 5,反映了事实我问了五个邻居。但是,条目的值不可能是正确的,因为有些是负数,而且许多超出了 0 到 999 的范围,即可能最近邻居的索引范围。为了比较,这是我的终端的输出,使用与上面几乎相同的代码,仅将“kmeans”更改为“kdtree”:
>>> from pyflann import *
>>> from numpy import *
>>> from numpy.random import *
>>> dataset = rand(1000, 100)
>>> testset = rand(10, 100)
>>> flann = FLANN()
>>> result,dists = flann.nn(dataset,testset, 5, algorithm="kdtree")
>>> print result
[[189 363 397 723 685]
[400 952 892 332 477]
[560 959 295 591 394]
[596 652 250 43 448]
[498 706 543 761 323]
[334 974 591 620 766]
[435 386 58 962 421]
[234 301 189 355 191]
[857 133 420 544 612]
[978 995 439 648 627]]
>>>
这一次,所有条目都是 0 到 999 之间的非负整数,正如预期的那样。当然,数据是随机生成的,因此结果会有所不同,但是使用“kmeans”参数会产生始终如一的愚蠢结果,而“kdtree”会产生始终合理的结果。
软件和操作系统详细信息
(0) Ubuntu 发行版:
Ubuntu 14.04 LTS
(1) libflann-dev:
打字:
sudo aptitude show libflann-dev
产生:
Package: libflann-dev
State: installed
Automatically installed: no
Version: 1.8.4-3
Priority: optional
Section: universe/libdevel
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Uncompressed Size: 11.2 M
Depends: libflann1.8 (= 1.8.4-3)
Description: Fast Library for Approximate Nearest Neighbors - development
FLANN is a library for performing fast approximate nearest neighbor searches in high dimensional spaces. It contains a collection of algorithms found to work best for
nearest neighbor search and a system for automatically choosing the best algorithm and optimum parameters depending on the dataset.
This package contains development files needed to build FLANN applications.
Homepage: http://www.cs.ubc.ca/~mariusm/index.php/FLANN/FLANN
(2)打字:
sudo aptitude show python
产生:
Package: python
State: installed
Automatically installed: no
Multi-Arch: allowed
Version: 2.7.5-5ubuntu3
Priority: optional
Section: python
Maintainer: Ubuntu Developers <ubuntu-devel-discuss@lists.ubuntu.com>
Architecture: amd64
Uncompressed Size: 687 k
Depends: python2.7 (>= 2.7.5-1~), python-minimal (= 2.7.5-5ubuntu3), libpython-stdlib (= 2.7.5-5ubuntu3)
Suggests: python-doc (= 2.7.5-5ubuntu3), python-tk (>= 2.7.5-1~)
Conflicts: python-central (< 0.5.5)
Breaks: python-bz2 (< 1.1-8), python-csv (< 1.0-4), python-email (< 2.5.5-3), update-manager-core (< 0.200.5-2)
Replaces: python-dev (< 2.6.5-2)
Provides: python-ctypes, python-email, python-importlib, python-profiler, python-wsgiref, python:any
Description: interactive high-level object-oriented language (default version)
Python, the high-level, interactive object oriented language, includes an extensive class library with lots of goodies for network programming, system administration,
sounds and graphics.
This package is a dependency package, which depends on Debian's default Python version (currently v2.7).
Homepage: http://www.python.org/
安装方法
我首先尝试使用以下命令安装 FLANN:
sudo apt-get install libflann1.8
安装 pyflann 后:
sudo pip install -e git+git://github.com/Captricity/pyflann.git#egg=pyflann,
我输入:
python -c 'import pyflann'
并收到错误消息:
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/mnt/working/src/pyflann/pyflann/__init__.py", line 27, in <module>
from index import *
File "/mnt/working/src/pyflann/pyflann/index.py", line 27, in <module>
from bindings.flann_ctypes import *
File "/mnt/working/src/pyflann/pyflann/bindings/__init__.py", line 30, in <module>
from flann_ctypes import *
File "/mnt/working/src/pyflann/pyflann/bindings/flann_ctypes.py", line 169, in <module>
raise ImportError('Cannot load dynamic library. Did you compile FLANN?')
ImportError: Cannot load dynamic library. Did you compile FLANN?
然后,在一个新的 ec2 实例上,我输入:
sudo apt-get install libflann-dev
sudo pip install -e git+git://github.com/Captricity/pyflann.git#egg=pyflann
跑了
python -c 'import pyflann'
毫无怨言。但是,我有上面描述的“kmeans”问题。
笔记
我已经成功地在我的 MacBookPro 上安装了 FLANN 和 pyflann,并且一切正常——即使使用“kmeans”作为最近邻查询参数也会产生合理的结果。