3

我已经快速检查了构建树并查询它与仅计算所有欧几里德距离的性能。如果我在这棵树上查询半径内的所有其他点,它不应该大大优于蛮力方法吗?

有谁知道为什么我的测试代码会产生这些不同的结果?我用错了吗?测试用例不适合 kd-trees 吗?

PS:这是我使用的代码的简化概念验证版本。我还存储和转换结果的完整代码可以在这里找到,但它会产生相同的结果。

进口

import numpy as np
from time import time
from scipy.spatial import KDTree as kd
from functools import reduce
import matplotlib.pyplot as plt

实现

def euclid(c, cs, r):
    return ((cs[:,0] - c[0]) ** 2 + (cs[:,1] - c[1]) ** 2 + (cs[:,2] - c[2]) ** 2) < r ** 2

def find_nn_naive(cells, radius):
    for i in range(len(cells)):
        cell = cells[i]
        cands = euclid(cell, cells, radius)

def find_nn_kd_seminaive(cells, radius):
    tree = kd(cells)
    for i in range(len(cells)):
        res = tree.query_ball_point(cells[i], radius)

def find_nn_kd_by_tree(cells, radius):
    tree = kd(cells)
    return tree.query_ball_tree(tree, radius)

测试设置

min_iter = 5000
max_iter = 10000
step_iter = 1000

rng = range(min_iter, max_iter, step_iter)
elapsed_naive = np.zeros(len(rng))
elapsed_kd_sn = np.zeros(len(rng))
elapsed_kd_tr = np.zeros(len(rng))

ei = 0
for i in rng:
    random_cells = np.random.rand(i, 3) * 400.
    t = time()
    r1 = find_nn_naive(random_cells, 50.)
    elapsed_naive[ei] = time() - t
    t = time()
    r2 = find_nn_kd_seminaive(random_cells, 50.)
    elapsed_kd_sn[ei] = time() - t
    t = time()
    r3 = find_nn_kd_by_tree(random_cells, 50.)
    elapsed_kd_tr[ei] = time() - t
    ei += 1

阴谋

plt.plot(rng, elapsed_naive, label='naive')
plt.plot(rng, elapsed_kd_sn, label='semi kd')
plt.plot(rng, elapsed_kd_tr, label='full kd')
plt.legend()
plt.show(block=True)

绘制结果

4

2 回答 2

2

如记录在scipy.spatial.KDTree()

对于大尺寸(20 已经很大),不要指望这会比蛮力运行快得多。高维最近邻查询是计算机科学中一个重要的开放问题。

(此注释也存在scipy.spatial.cKDTree(),尽管这可能是一个复制粘贴文档错误)。

我冒昧地用适当的函数重写了你的代码,这样我就可以运行一些自动化的基准测试(基于这个模板)。我还包括了一个蛮力 Numba 实现:

import numpy as np
import scipy as sp
import numba as nb

import scipy.spatial

SCALE = 400.0
RADIUS = 50.0 


def find_nn_np(points, radius=RADIUS, p=2):
    n_points, n_dim = points.shape
    result = np.empty(n_points, dtype=object)
    for i in range(n_points):
        result[i] = np.where(np.sum(np.abs(points - points[i:i + 1, :]) ** p, axis=1) < radius ** p)[0].tolist()
    return result


def find_nn_kd_tree(points, radius=RADIUS):
    tree = sp.spatial.KDTree(points)
    return tree.query_ball_point(points, radius)


def find_nn_kd_tree_cy(points, radius=RADIUS):
    tree = sp.spatial.cKDTree(points)
    return tree.query_ball_point(points, radius)


@nb.jit
def neighbors_indexes_jit(radius, center, points, p=2):
    n_points, n_dim = points.shape
    k = 0
    res_arr = np.empty(n_points, dtype=nb.int64)
    for i in range(n_points):
        dist = 0.0
        for j in range(n_dim):
            dist += abs(points[i, j] - center[j]) ** p
        if dist < radius ** p:
            res_arr[k] = i
            k += 1
    return res_arr[:k]


@nb.jit(forceobj=True, parallel=True)
def find_nn_jit(points, radius=RADIUS):
    n_points, n_dim = points.shape
    result = np.empty(n_points, dtype=object)
    for i in nb.prange(n_points):
        result[i] = neighbors_indexes_jit(radius, points[i], points, 2)
    return result

这些是我得到的基准(我省略了scipy.spatial.KDTree(),因为它与图表相去甚远,与您的发现一致):

mb_full


(为完整起见,以下是适配模板所需的代码)

def gen_input(n, dim=2, scale=SCALE):
    return scale * np.random.rand(n, dim)


def equal_output(a, b):
    return all(sorted(a_i) == sorted(b_i) for a_i, b_i in zip(a, b))


funcs = find_nn_np, find_nn_jit, find_nn_kd_tree_cy


input_sizes = tuple(int(2 ** (2 + (1 * i) / 4)) for i in range(32, 32 + 16 + 1))
print('Input Sizes:\n', input_sizes, '\n')


runtimes, input_sizes, labels, results = benchmark(
    funcs, gen_input=gen_input, equal_output=equal_output,
    input_sizes=input_sizes)


plot_benchmarks(runtimes, input_sizes, labels, units='s')
于 2019-09-19T17:11:55.977 回答
2

TL;博士:

切换到scipy.spatial.cKDTreesklearn.neighbors.KDTree获得 kd-tree 算法预期的性能。

于 2019-09-19T18:33:29.803 回答