8

我想计算一个大型(1,000,000 x 3,000)布尔 numpy 数组的索引权重总和。大布尔数组不经常更改,但权重在查询时出现,我需要非常快的答案,无需复制整个大数组,或将小权重数组扩展到大数组的大小。

结果应该是一个包含 1,000,000 个条目的数组,每个条目的权重数组条目的总和对应于该行的 True 值。

我研究过使用掩码数组,但它们似乎需要构建一个与我的大型布尔数组大小相同的权重数组。

下面的代码给出了正确的结果,但在乘法步骤中我买不起那个副本。甚至不需要乘法,因为值数组是布尔值,但至少它可以正确处理广播。

我是 numpy 的新手,并且很喜欢它,但我即将因为这个特殊问题而放弃它。我已经学会了足够的 numpy 来知道远离任何在 python 中循环的东西。

我的下一步是用 C 语言编写这个例程(顺便说一句,它的另一个好处是让我通过使用位而不是字节来节省内存。)

除非你们中的一位 numpy 大师可以将我从 cython 中拯救出来?

from numpy import array, multiply, sum

# Construct an example values array, alternating True and False.
# This represents four records of three attributes each:
#    array([[False,  True, False],
#           [ True, False,  True],
#           [False,  True, False],
#           [ True, False,  True]], dtype=bool)
values = array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3))

# Construct example weights, one for each attribute:
#    array([1, 2, 3])
weights = array(range(1, 4))

# Create expensive NEW array with the weights for the True attributes.
# Broadcast the weights array into the values array.
#    array([[0, 2, 0],
#           [1, 0, 3],
#           [0, 2, 0],
#           [1, 0, 3]])
weighted = multiply(values, weights)

# Add up the weights:
#    array([2, 4, 2, 4])
answers = sum(weighted, axis=1)

print answers

# Rejected masked_array solution is too expensive (and oddly inverts
# the results):
masked = numpy.ma.array([[1,2,3]] * 4, mask=values)
4

4 回答 4

4

点积(或内积)就是你想要的。它允许您获取一个大小矩阵m×n和一个长度向量,n并将它们相乘,得到一个长度向量m,其中每个条目是矩阵的一行的加权和,向量的条目作为权重。

Numpy 将其实现为array1.dot(array2)(或numpy.dot(array1, array2)在旧版本中)。例如:

from numpy import array

values = array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3))

weights = array(range(1, 4))

answers = values.dot(weights)
print answers
# output: [ 2 4 2 4 ]

(不过,您应该使用timeit模块对此进行基准测试。)

于 2012-04-19T01:17:31.310 回答
3

It seems likely that dbaupp's answer is the correct one. But just for the sake of diversity, here's another solution that saves memory. This will work even for operations that don't have a built-in numpy equivalent.

>>> values = numpy.array([(x % 2) for x in range(12)], dtype=bool).reshape((4,3))
>>> weights = numpy.array(range(1, 4))
>>> weights_stretched = numpy.lib.stride_tricks.as_strided(weights, (4, 3), (0, 8))

numpy.lib.stride_tricks.as_strided is a wonderful little function! It allows you to specify shape and strides values that allow a small array to mimic a much larger array. Observe -- there aren't really four rows here; it just looks that way:

>>> weights_stretched[0][0] = 4
>>> weights_stretched 
array([[4, 2, 3],
       [4, 2, 3],
       [4, 2, 3],
       [4, 2, 3]])

So instead of passing a huge array to MaskedArray, you can pass a smaller one. (But as you've already noticed, numpy masking works in the opposite way you might expect; truth masks, rather than revealing, so you'll have to store your values inverted.) As you can see, MaskedArray doesn't copy any data; it just reflects whatever is in weights_stretched:

>>> masked = numpy.ma.MaskedArray(weights_stretched, numpy.logical_not(values))
>>> weights_stretched[0][0] = 1
>>> masked
masked_array(data =
 [[-- 2 --]
 [1 -- 3]
 [-- 2 --]
 [1 -- 3]],
      mask =
 [[ True False  True]
 [False  True False]
 [ True False  True]
 [False  True False]],
      fill_value=999999)

Now we can just pass it to sum:

>>> sum(masked, axis=1)
masked_array(data = [2 4 2 4],
      mask = [False False False False],
      fill_value=999999)

I benchmarked numpy.dot and the above against a 1,000,000 x 30 array. This is the result on a relatively modern MacBook Pro (numpy.dot is dot1; mine is dot2):

>>> %timeit dot1(values, weights)
1 loops, best of 3: 194 ms per loop
>>> %timeit dot2(values, weights)
1 loops, best of 3: 459 ms per loop

As you can see, the built-in numpy solution is faster. But stride_tricks is worth knowing about regardless, so I'm leaving this.

于 2012-04-19T02:07:20.370 回答
1

这对你有用吗?

a = np.array([sum(row * weights) for row in values])

这用于sum()立即对row * weights值求和,因此您不需要内存来存储所有中间值。然后列表推导收集所有值。

你说你想避免任何“在 Python 中循环”的东西。这至少使用 Python 的 C 内容进行循环,而不是显式的 Python 循环,但它不能像 NumPy 解决方案那样快,因为它使用已编译的 C 或 Fortran。

于 2012-04-19T01:16:47.740 回答
0

我不认为你需要 numpy 来做这样的事情。而 1000000 x 3000 是一个巨大的数组;这很可能不适合您的 RAM。

我会这样做:

假设您的数据最初位于文本文件中:

False,True,False
True,False,True
False,True,False
True,False,True

我的代码:

weight = range(1,4)    
dicto = {'True':1, 'False':0}

with open ('my_data.txt') as fin:

    a = sum(sum(dicto[ele]*w for ele,w in zip(line.strip().split(','),weight)) for line in fin)

结果:

>>> a
12

编辑:

我想我第一次稍微误读了这个问题,并将所有内容总结在一起。这是给出OP所追求的确切解决方案的解决方案:

weight = range(1,4)
dicto = {'True':1, 'False':0}

with open ('my_data.txt') as fin:

    a = [sum(dicto[ele]*w for ele,w in zip(line.strip().split(','),weight)) for line in fin]

结果:

>>> a
[2, 4, 2, 4]
于 2012-04-19T01:20:40.893 回答