python - 将数组中的低值归零的最快方法？

Question

所以，假设我有 100,000 个浮点数组，每个数组有 100 个元素。我需要最大的 X 个值，但前提是它们大于 Y。任何不匹配的元素都应该设置为 0。在 Python 中最快的方法是什么？必须维持秩序。大多数元素已设置为 0。

样本变量：

array = [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

预期结果：

array = [0, .25, 0, .15, .5, 0, 0, 0, 0, 0]

score 78 · Accepted Answer

这是NumPy的典型工作，对于这些类型的操作来说非常快：

array_np = numpy.asarray(array)
low_values_flags = array_np < lowValY  # Where values are low
array_np[low_values_flags] = 0  # All low values set to 0

现在，如果您只需要 highCountX 最大的元素，您甚至可以“忘记”小元素（而不是将它们设置为 0 并对其进行排序），而只对大元素列表进行排序：

array_np = numpy.asarray(array)
print numpy.sort(array_np[array_np >= lowValY])[-highCountX:]

当然，如果您只需要几个元素，则对整个数组进行排序可能不是最佳的。根据您的需要，您可能需要考虑标准heapq模块。

score 20 · Accepted Answer

20

from scipy.stats import threshold
thresholded = threshold(array, 0.5)

:)

于 2014-03-10T02:42:55.357 回答

score 7 · Accepted Answer

NumPy 中有一个特殊的 MaskedArray 类可以做到这一点。您可以根据任何前提条件“屏蔽”元素。这比分配零更能代表您的需要：numpy 操作将在适当时忽略掩码值（例如，查找平均值）。

>>> from numpy import ma
>>> x = ma.array([.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0])
>>> x1 = ma.masked_inside(0, 0.1) # mask everything in 0..0.1 range
>>> x1
masked_array(data = [-- 0.25 -- 0.15 0.5 -- -- -- -- --],
         mask = [ True False True False False True True True True True],
   fill_value = 1e+20)
>>> print x.filled(0) # Fill with zeroes
[ 0 0.25 0 0.15 0.5 0 0 0 0 0 ]

作为一个额外的好处，如果您需要，matplotlib 可视化库很好地支持掩码数组。

关于 numpy 中的掩码数组的文档

score 6 · Accepted Answer

使用numpy：

# assign zero to all elements less than or equal to `lowValY`
a[a<=lowValY] = 0 
# find n-th largest element in the array (where n=highCountX)
x = partial_sort(a, highCountX, reverse=True)[:highCountX][-1]
# 
a[a<x] = 0 #NOTE: it might leave more than highCountX non-zero elements
           # . if there are duplicates

partial_sort可能在哪里：

def partial_sort(a, n, reverse=False):
    #NOTE: in general it should return full list but in your case this will do
    return sorted(a, reverse=reverse)[:n]

表达式a[a<value] = 0可以不写numpy如下：

for i, x in enumerate(a):
    if x < value:
       a[i] = 0

score 5 · Accepted Answer

最简单的方法是：

topX = sorted([x for x in array if x > lowValY], reverse=True)[highCountX-1]
print [x if x >= topX else 0 for x in array]

分段，这将选择所有大于的元素lowValY：

[x for x in array if x > lowValY]

该数组仅包含大于阈值的元素数。然后，对其进行排序，使最大值位于开头：

sorted(..., reverse=True)

然后列表索引采用顶部highCountX元素的阈值：

sorted(...)[highCountX-1]

最后，使用另一个列表推导填充原始数组：

[x if x >= topX else 0 for x in array]

有一个边界条件，其中有两个或多个相等的元素（在您的示例中）是第三高元素。结果数组将多次包含该元素。

还有其他边界条件，例如 if len(array) < highCountX。处理这些条件留给实现者。

score 2 · Accepted Answer

将低于某个阈值的元素设置为零很容易：

array = [ x if x > threshold else 0.0 for x in array ]

（如果需要，加上偶尔的 abs() 。）

然而，对 N 个最高数字的要求有点模糊。如果在阈值之上有 N+1 个相等的数字怎么办？截断哪一个？

您可以先对数组进行排序，然后将阈值设置为第 N 个元素的值：

threshold = sorted(array, reverse=True)[N]
array = [ x if x >= threshold else 0.0 for x in array ]

注意：此解决方案针对可读性而非性能进行了优化。

score 1 · Accepted Answer

您可以使用 map 和 lambda，它应该足够快。

new_array = map(lambda x: x if x>y else 0, array)

score 0 · Accepted Answer

使用堆。

这及时有效O(n*lg(HighCountX))。

import heapq

heap = []
array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

for i in range(1,highCountX):
    heappush(heap, lowValY)
    heappop(heap)

for i in range( 0, len(array) - 1)
    if array[i] > heap[0]:
        heappush(heap, array[i])

min = heap[0]

array = [x if x >= min else 0 for x in array]

deletemin 在堆O(lg(k))和插入中工作，O(lg(k))或者O(1)取决于您使用的堆类型。

score 0 · Accepted Answer

正如 egon 所说，使用堆是个好主意。但是您可以使用该heapq.nlargest功能来减少一些工作量：

import heapq 

array =  [.06, .25, 0, .15, .5, 0, 0, 0.04, 0, 0]
highCountX = 3
lowValY = .1

threshold = max(heapq.nlargest(highCountX, array)[-1], lowValY)
array = [x if x >= threshold else 0 for x in array]

python - 将数组中的低值归零的最快方法？

9 回答 9

Related

Reference