python - Cython 没有速度提升

Question

我正在尝试定义一个包含用于模拟积分的内部循环的函数。

问题是速度。在我的机器上评估一次函数可能需要 30 秒。由于我的最终目标是最小化这个功能，所以一些额外的速度会很好。

因此，我试图让 Cython 工作，但我一定犯了一个严重的错误（可能很多！）。按照 Cython 文档，我尝试输入我的变量。这样做之后，代码和纯 Python 一样慢。这似乎很奇怪。

这是我的代码：

import numpy as np 
cimport cython
cimport numpy as np
import minuit

data = np.genfromtxt('q6data.csv', usecols = np.arange(1, 24, 1), delimiter = ',')  

cdef int ns    = 1000                 # Number of simulation draws
cdef int K     = 5                    # Number of observed characteristics, including            constant
cdef int J     = len(data[:,1])       # Number of products, including outside
cdef double tol   = 0.0001            # Inner GMM loop tolerance
nu = np.random.normal(0, 1, (6, ns))  # ns random deviates

@cython.boundscheck(False)
@cython.wraparound(False)


def S(np.ndarray[double, ndim=1] delta, double s1, double s2, double s3, double s4,  double s5, double a):
    """Computes the simulated integrals, one for each good.
    Parameters: delta is an array of length J containing mean product specific utility levels
    Returns: Numpy array with length J."""
    cdef np.ndarray[double, ndim=2] mu_ij = np.dot((data[:,2:7]*np.array([s1, s2, s3, s4, s5])), nu[1:K+1,:])
    cdef np.ndarray[double, ndim=2] mu_y  = a * np.log(np.exp(data[:,21].reshape(J,1) +  data[:,22].reshape(J,1)*nu[0,:].reshape(1, ns)) - data[:,7].reshape(J,1))
    cdef np.ndarray[double, ndim=2] V = delta.reshape(J,1) + mu_ij + mu_y
    cdef np.ndarray[double, ndim=2] exp_vi = np.exp(V)
    cdef np.ndarray[double, ndim=2] P_i = (1.0 / np.sum(exp_vi[np.where(data[:,1] == 71)], 0)) * exp_vi[np.where(data[:,1] == 71)] 
    cdef int yrs = 19
    cdef int yr
    for yr in xrange(yrs):
        P_yr = (1.0 / np.sum(exp_vi[np.where(data[:,1]== (yr + 72))], 0)) *   exp_vi[np.where(data[:,1] == (yr + 72))]
        P_i  = np.concatenate((P_i, P_yr)) 
    cdef np.ndarray[double, ndim=1] S = np.zeros(dtype = "d", shape = J)
    cdef int j
    for j in xrange(ns):
        S += P_i[:,j]
    return (1.0 / ns) * S

def d_infty(np.ndarray[double, ndim=1] x, np.ndarray[double, ndim=1] y):
    """Sup norm."""
    return np.max(np.abs(x - y)) 

def T(np.ndarray[double, ndim=1] delta_exp, double s1, double s2, double s3, double s4,  double s5, double a):
    """The contraction operator.  This function takes the parameters and the exponential
    of the starting value of delta and returns the fixed point.""" 
    cdef int iter = 0
    cdef int maxiter = 200
    cdef int i
    for i in xrange(maxiter): 
        delta1_exp = delta_exp * (data[:, 8] / S(np.log(delta_exp), s1, s2, s3, s4, s5, a))                    
        print i
        if d_infty(delta_exp, delta1_exp) < tol:                                       
            break
        delta_exp = delta1_exp
    return np.log(delta1_exp)


def Q(double s1, double s2, double s3, double s4, double s5, double a):
    """GMM objective function."""  
    cdef np.ndarray[double, ndim=1] delta0_exp = np.exp(data[:,10])                                                     
    cdef np.ndarray[double, ndim=1] delta1 = T(delta0_exp, s1, s2, s3, s4, s5, a)
    delta1[np.where(data[:,10]==0)] = np.zeros(len(np.where(data[:,10]==0)))            
    cdef np.ndarray[double, ndim=1] xi =  delta1 - (np.dot(data[:,2:7],   np.linalg.lstsq(data[:,2:7], delta1)[0]))   
    cdef np.ndarray[double, ndim=2] g_J = xi.reshape(J,1) * data[:,11:21]
    cdef np.ndarray[double, ndim=1] G_J = (1.0 / J) * np.sum(g_J, 0) 
    return np.sqrt(np.dot(G_J, G_J))

我已经分析了代码，它似乎是函数 S，积分模拟器，正在扼杀性能。无论如何，我预计输入变量至少会带来一些速度提升。由于它没有产生任何收益，我被引导相信我犯了一些根本性的错误。

有人在 Cython 代码中看到可能导致此结果的明显错误吗？

哦，由于我对编程很陌生，肯定有很多不好的风格和减慢代码速度的事情。如果你有时间，也请随时告诉我这些观点。

score 34 · Accepted Answer

Cython 可以生成一个 html 文件来帮助解决这个问题：

cython -a MODULE.py

这显示了每行源代码通过各种黄色阴影着色为白色。黄色越深，该行上仍在执行的 Python 行为就越动态。对于包含一些黄色的每一行，您需要添加更多静态类型声明。

当我这样做时，我喜欢将我遇到问题的部分源代码拆分为许多单独的行，每个表达式或运算符一个，以获得最精细的视图。

没有这个，很容易忽略变量、函数调用或运算符的一些静态类型声明。（例如，索引运算符 x[y] 仍然是一个完全动态的 Python 操作，除非您另有声明）

score 17 · Accepted Answer

Cython 不提供自动性能提升，您必须了解其内部结构并检查生成的 C 代码。

特别是如果你想提高循环性能，你必须避免在其中调用 Python 函数，在这种情况下你碰巧做了很多事情（所有的np.调用都是 Python 调用、切片，可能还有其他事情）。

有关使用 Cython 进行性能优化的一般指南，请参阅此页面（优化时 -a 开关确实很方便），以及优化 numpy 代码时的特殊性。

score 11 · Accepted Answer

您绝对可以通过使用更多 Numpy 的功能来加速您的代码。

例如：

cdef np.ndarray[double, ndim=1] S = np.zeros(dtype = "d", shape = J)
cdef int j
for j in xrange(ns):
    S += P_i[:,j]

会更快更清晰，因为

S = P_i.sum(axis=1)

您还重复了一些计算，因此花费的时间是必要的两倍。例如

np.where(data[:,1]==(yr + 72))

只能计算一次并将其存储在可以重复使用的变量中。

您还执行了大量的整形和切片：从一开始就让您的变量采用更简单的格式可能会有所帮助。如果可能的话，你的代码会更清晰，优化也会更明显。

score 6 · Accepted Answer

分析器会帮助您找出哪个部分慢吗？我喜欢使用标准库分析器运行程序：

python -O -m cProfile -o profile.out MYAPP.py

然后在“RunSnakeRun”GUI 中查看输出：

runsnake profile.out

可以从这里安装 RunSnakeRun： http ://www.vrplumber.com/programming/runsnakerun/

RunSnakeRun 截图

score 1 · Accepted Answer

在您于 11 月 28 日发表评论后，仅拆分数据：

import sys
import time
import numpy as np

def splitdata( data, n, start=1971 ):
    """ split data into n pieces, where col 1 == start .. start + n """
        # not fancy, fast enough for small n
    split = n * [None]
    for j in range(n):
        split[j] = data[ data[:,1] == start + j ]
    return split  # [ arrays: col1 0, col1 1 ... ]

#...........................................................................
N = 2237
ncol = 21
start = 1971
n = 20
seed = 1
exec "\n".join( sys.argv[1:] )  # run this.py N= ...
np.set_printoptions( 2, threshold=100, suppress=True )  # .2f
np.random.seed(seed)

print "N=%d  ncol=%d  n=%d" % (N, ncol, n)
data = np.random.uniform( start, start + n, (N,ncol) )
data[:,1] = data[:,1].round()
t0 = time.time()

split = splitdata( data, n, start )  # n pieces

print "time: %.0f us splitdata" % ((time.time() - t0) * 1e6)
for y, yeardata in enumerate(split):
    print "%d  %d  %g" % (start + y, len(yeardata), yeardata[:,0].sum())

-->

time: 27632 us splitdata  # old mac ppc
1971  69  136638
1972  138  273292
...

score 1 · Accepted Answer

接受这里给出的建议，我花了更多时间分析上面的代码。为了希望能清理一下我定义的东西

我已经对代码进行了更多的分析，并且更好地了解了哪些部分是最慢的。我另外定义了

X = data[:, 2:7]
m_y = data[:, 21].reshape(J,1)
sigma_y = 1.0
price = data[:, 7].reshape(J, 1)
shares_data = data[:,8]

然后是以下行占用了大部分时间。

mu_ij = np.dot((X*np.array([s1, s2, s3, s4, s5])), nu[1:K+1,:])
mu_y  = a * np.log(np.exp(m_y + sigma_y*nu[0,:].reshape(1,ns)) - price)
V = delta.reshape(J,1) + mu_ij + mu_y
exp_vi = np.exp(V)
P_i = (1.0 / np.sum(exp_vi[np.where(data[:,1]==71)], 0)) *  exp_vi[np.where(data[:,1]==71)] 
for yr in xarange(19):
    P_yr = (1.0 / np.sum(exp_vi[np.where(data[:,1]==yr)], 0)) * exp_vi[np.where(data[:,1]==yr)]
P_i  = np.concatenate((P_i, P_yr))

我的印象是，这是实现我的目标的一种过于繁琐的方式。我希望有人能够就如何加快这些线路提供一些建议。也许我缺少 Numpy 功能？如果这个问题没有很好地说明对您有帮助，我很乐意提供有关我的问题背景的更多详细信息。谢谢！

score -7 · Accepted Answer

“基本错误”是您期望 Python 在长循环中具有良好的性能。它是一种解释性语言，在实现和 ctyping 之间切换对此没有任何作用。有一些用于快速计算的数字 Python 库，大部分是用 C 编写的。例如，如果您已经使用numpy数组，为什么不更进一步并使用scipy您的高级数学呢？它将提高可读性和速度。

python - Cython 没有速度提升

7 回答 7

Related

Reference