1

I have two lists of float numbers, and I want to calculate the set difference between them.

With numpy I originally wrote the following code:

aprows = allpoints.view([('',allpoints.dtype)]*allpoints.shape[1])
rprows = toberemovedpoints.view([('',toberemovedpoints.dtype)]*toberemovedpoints.shape[1])
diff = setdiff1d(aprows, rprows).view(allpoints.dtype).reshape(-1, 2)

This works well for things like integers. In case of 2d points with float coordinates that are the result of some geometrical calculations, there's a problem of finite precision and rounding errors causing the set difference to miss some equalities. For now I resorted to the much, much slower:

diff = []
for a in allpoints: 
    remove = False
    for p in toberemovedpoints:
        if norm(p-a) < 0.1:
            remove = True
    if not remove:
        diff.append(a)
return array(diff)

But is there a way to write this with numpy and gain back the speed?

Note that I want the remaining points to still have their full precision, so first rounding the numbers and then do a set difference probably is not the way forward (or is it? :) )

Edited to add an solution based on scipy.KDTree that seems to work:

def remove_points_fast(allpoints, toberemovedpoints):
    diff = []
    removed = 0
    # prepare a KDTree
    from scipy.spatial import KDTree
    tree = KDTree(toberemovedpoints,  leafsize=allpoints.shape[0]+1)
    for p in allpoints:
        distance, ndx = tree.query([p], k=1)
        if distance < 0.1:
            removed += 1
        else:
            diff.append(p)
    return array(diff), removed
4

1 回答 1

1

如果您想使用矩阵形式执行此操作,则较大的数组会消耗大量内存。如果这无关紧要,那么您可以通过以下方式获得差异矩阵:

diff_array = allpoints[:,None] - toberemovedpoints[None,:]

结果数组的行数与 allpoints 中的点数一样多,列数与 tobermovedpoints 中的点数一样多。然后你可以用任何你想要的方式来操作它(例如计算绝对值),这会给你一个布尔数组。要查找哪些行有任何命中(绝对差异 < .1),请使用numpy.any

hits = numpy.any(numpy.abs(diff_array) < .1, axis=1)

现在你有一个向量,它的项目数与差异数组中的行数相同。您可以使用该向量来索引所有点(否定,因为我们想要不匹配的点):

return allpoints[-hits]

这是一种麻木的方式。但是,正如我上面所说,它需要大量内存。


如果您有更大的数据,那么最好逐点进行。像这样的东西:

return allpoints[-numpy.array([numpy.any(numpy.abs(a-toberemoved) < .1) for a in allpoints ])]

这在大多数情况下应该表现良好,并且内存使用量远低于矩阵解决方案。(出于文体原因,您可能想要使用 numpy.all 而不是 numpy.any 并翻转比较以摆脱否定。)

(请注意,代码中可能存在打印错误。)

于 2014-06-15T21:17:18.750 回答