python - can I do fast set difference with floats using numpy if elements are equal up to some tolerance

Question

I have two lists of float numbers, and I want to calculate the set difference between them.

With numpy I originally wrote the following code:

aprows = allpoints.view([('',allpoints.dtype)]*allpoints.shape[1])
rprows = toberemovedpoints.view([('',toberemovedpoints.dtype)]*toberemovedpoints.shape[1])
diff = setdiff1d(aprows, rprows).view(allpoints.dtype).reshape(-1, 2)

This works well for things like integers. In case of 2d points with float coordinates that are the result of some geometrical calculations, there's a problem of finite precision and rounding errors causing the set difference to miss some equalities. For now I resorted to the much, much slower:

diff = []
for a in allpoints: 
    remove = False
    for p in toberemovedpoints:
        if norm(p-a) < 0.1:
            remove = True
    if not remove:
        diff.append(a)
return array(diff)

But is there a way to write this with numpy and gain back the speed?

Note that I want the remaining points to still have their full precision, so first rounding the numbers and then do a set difference probably is not the way forward (or is it? :) )

Edited to add an solution based on scipy.KDTree that seems to work:

def remove_points_fast(allpoints, toberemovedpoints):
    diff = []
    removed = 0
    # prepare a KDTree
    from scipy.spatial import KDTree
    tree = KDTree(toberemovedpoints,  leafsize=allpoints.shape[0]+1)
    for p in allpoints:
        distance, ndx = tree.query([p], k=1)
        if distance < 0.1:
            removed += 1
        else:
            diff.append(p)
    return array(diff), removed

score 1 · Accepted Answer

如果您想使用矩阵形式执行此操作，则较大的数组会消耗大量内存。如果这无关紧要，那么您可以通过以下方式获得差异矩阵：

diff_array = allpoints[:,None] - toberemovedpoints[None,:]

结果数组的行数与 allpoints 中的点数一样多，列数与 tobermovedpoints 中的点数一样多。然后你可以用任何你想要的方式来操作它（例如计算绝对值），这会给你一个布尔数组。要查找哪些行有任何命中（绝对差异 < .1），请使用numpy.any：

hits = numpy.any(numpy.abs(diff_array) < .1, axis=1)

现在你有一个向量，它的项目数与差异数组中的行数相同。您可以使用该向量来索引所有点（否定，因为我们想要不匹配的点）：

return allpoints[-hits]

这是一种麻木的方式。但是，正如我上面所说，它需要大量内存。

如果您有更大的数据，那么最好逐点进行。像这样的东西：

return allpoints[-numpy.array([numpy.any(numpy.abs(a-toberemoved) < .1) for a in allpoints ])]

这在大多数情况下应该表现良好，并且内存使用量远低于矩阵解决方案。（出于文体原因，您可能想要使用 numpy.all 而不是 numpy.any 并翻转比较以摆脱否定。）

（请注意，代码中可能存在打印错误。）

python - can I do fast set difference with floats using numpy if elements are equal up to some tolerance

1 回答 1

Related

Reference