python - 快速评估产品

Question

我想计算大量点位置到一组预定义位置的最小距离（问题不在于空间，空间大小>60）。我为此使用了 cKDTree，但为了避免使用 scipy，我想知道是否有一种聪明的方法可以使用 numpy 数组来计算它。循环很容易：

point_locs = # Shape n_dims * n_samples
test_points = # Shape n_dims * n_points
min_dist = np.zeros ( n_points )
for i in n_points:
   min_dist[i] = np.sum(( point_locs - test_points[:,i])**2,axis=1).argmin()

有什么比这更快的吗？通常，n_points约为 10^5-10^7。

score 3 · Accepted Answer

来自scipy.spatial.KDTree的文档：

对于大尺寸（20 已经很大），不要指望这会比蛮力运行得更快。高维最近邻查询是计算机科学中一个重要的开放问题。

因此，它不仅是计算机科学中的一个开放问题，而且你的蛮力方法很可能是一个足够的次优选择。如果您可以利用数据中的某些已知结构，即所有点都属于 n 个已知空间区域之一，那么您可以将问题分解。

score 1 · Accepted Answer

您的代码不是有效的 Python，我认为您混淆了您的形状......除此之外，如果您有足够的内存，您可以通过使用广播来矢量化距离计算来摆脱循环。如果你有这些数据：

n_set_points = 100
n_test_points = 10000
n_dims = 60

set_points = np.random.rand(n_set_points, n_dims) 
test_points = np.random.rand(n_test_points, n_dims)

那么这是最直接的计算：

# deltas.shape = (n_set_points, n_test_point, n_dims)
deltas = (set_points[:, np.newaxis, :] -
          test_points[np.newaxis, ...])

# dist[j, k] holds the squared distance between the
# j-th set_point and the k-th test point
dist = np.sum(deltas*deltas, axis=-1)

# nearest[j] is the index of the set_point closest to
# each test_point, has shape (n_test_points,)
nearest = np.argmin(dist, axis=0)

交易破坏者是您是否可以存储deltas在内存中：它可以是一个巨大的数组。如果这样做，则可以通过使用更神秘但更有效的距离计算来获得一些性能：

dist = np.einsum('jkd,jkd->jk', deltas, deltas)

如果deltas太大，请将您的 test_points 分成可管理的块，然后循环遍历这些块，例如：

def nearest_neighbor(set_pts, test_pts, chunk_size):
    n_test_points = len(test_pts)
    ret = np.empty((n_test_points), dtype=np.intp)

    for chunk_start in xrange(0, n_test_points ,chunk_size):
        deltas = (set_pts[:, np.newaxis, :] -
                  test_pts[np.newaxis,
                           chunk_start:chunk_start + chunk_size, :])
        dist = np.einsum('jkd,jkd->jk', deltas,deltas)
        ret[chunk_start:chunk_start + chunk_size] = np.argmin(dist, axis=0)
    return ret

%timeit nearest_neighbor(set_points, test_points, 1)
1 loops, best of 3: 283 ms per loop

%timeit nearest_neighbor(set_points, test_points, 10)
1 loops, best of 3: 175 ms per loop

%timeit nearest_neighbor(set_points, test_points, 100)
1 loops, best of 3: 384 ms per loop

%timeit nearest_neighbor(set_points, test_points, 1000)
1 loops, best of 3: 365 ms per loop

%timeit nearest_neighbor(set_points, test_points, 10000)
1 loops, best of 3: 374 ms per loop

因此，通过进行部分矢量化可以获得一些性能。

python - 快速评估产品

2 回答 2

Related

Reference