python - cython numpy 累积函数

Question

我需要实现一个函数来对具有可变节长度的数组的元素求和。所以，

a = np.arange(10)
section_lengths = np.array([3, 2, 4])
out = accumulate(a, section_lengths)
print out
array([  3.,   7.,  35.])

我在这里尝试了一个实现cython：

https://gist.github.com/2784725

对于性能，我将numpy在 section_lengths 都相同的情况下与纯解决方案进行比较：

LEN = 10000
b = np.ones(LEN, dtype=np.int) * 2000
a = np.arange(np.sum(b), dtype=np.double)
out = np.zeros(LEN, dtype=np.double)

%timeit np.sum(a.reshape(-1,2000), axis=1)
10 loops, best of 3: 25.1 ms per loop

%timeit accumulate.accumulate(a, b, out)
10 loops, best of 3: 64.6 ms per loop

您对提高性能有什么建议吗？

score 2 · Accepted Answer

您可以尝试以下一些方法：

除了@cython.boundscheck(False)编译器指令，还尝试添加@cython.wraparound(False)
在您的setup.py脚本中，尝试添加一些优化标志：

ext_modules = [Extension("accumulate", ["accumulate.pyx"], extra_compile_args=["-O3",])]
查看生成的 .html 文件cython -a accumulate.pyx，看看是否有部分缺少静态类型或严重依赖 Python C-API 调用：

http://docs.cython.org/src/quickstart/cythonize.html#determining-where-to-add-types
return在方法的末尾添加一条语句。目前它在你的紧密循环中做一堆不必要的错误检查i_el += 1。
不确定它是否会有所作为，但我倾向于制作循环计数器cdef unsigned int，而不仅仅是int

您也可以在section_lengths不相等时将您的代码与 numpy 进行比较，因为它可能需要的不仅仅是一个简单的sum.

score 1 · Accepted Answer

在nest for 循环更新out[i_bas]慢的情况下，可以创建一个临时变量来做accumerate，并out[i_bas]在nest for 循环完成时更新。以下代码将与 numpy 版本一样快：

import numpy as np
cimport numpy as np

ctypedef np.int_t DTYPE_int_t
ctypedef np.double_t DTYPE_double_t

cimport cython
@cython.boundscheck(False)
@cython.wraparound(False)
def accumulate(
       np.ndarray[DTYPE_double_t, ndim=1] a not None,
       np.ndarray[DTYPE_int_t, ndim=1] section_lengths not None,
       np.ndarray[DTYPE_double_t, ndim=1] out not None,
       ):
    cdef int i_el, i_bas, sec_length, lenout
    cdef double tmp
    lenout = out.shape[0]
    i_el = 0
    for i_bas in range(lenout):
        tmp = 0
        for sec_length in range(section_lengths[i_bas]):
            tmp += a[i_el]
            i_el+=1
        out[i_bas] = tmp

python - cython numpy 累积函数

2 回答 2

Related

Reference