python - 在 Cython 中优化代码的技巧

Question

我有一个相对简单的问题（我认为）。我正在研究一段 Cython 代码，该代码在给定应变和特定方向时计算应变椭圆的半径（即在一定量的应变下平行于给定方向的半径）。在每个程序运行期间，这个函数被调用了几百万次，分析表明这个函数是性能方面的限制因素。这是代码：

# importing math functions from a C-library (faster than numpy)
from libc.math cimport sin, cos, acos, exp, sqrt, fabs, M_PI

cdef class funcs:

    cdef inline double get_r(self, double g, double omega):
        # amount of strain: g, angle: omega
        cdef double l1, l2, A, r, g2, gs   # defining some variables
        if g == 0: return 1   # no strain means the strain ellipse is a circle
        omega = omega*M_PI/180   # converting angle omega to radians
        g2 = g*g
        gs = g*sqrt(4 + g2)
        l1 = 0.5*(2 + g2 + gs)   # l1 and l2: eigenvalues of the Cauchy strain tensor
        l2 = 0.5*(2 + g2 - gs)
        A = acos(g/sqrt(g2 + (1 - l2)**2))   # orientation of the long axis of the ellipse
        r = 1./sqrt(sqrt(l2)*(cos(omega - A)**2) + sqrt(l1)*(sin(omega - A)**2))   # the radius parallel to omega
        return r   # return of the jedi

每次调用运行这段代码大约需要 0.18 微秒，我认为对于这样一个简单的函数来说有点长。另外，math.h有一个 square(x) 函数，但我不能从libc.math库中导入它，有人知道怎么做吗？还有什么其他建议可以进一步提高这段代码的性能吗？

更新 2013/09/04：

似乎有更多的东西在起作用而不是看起来。当我分析一个调用get_r1000 万次的函数时，我得到的性能与调用另一个函数不同。我添加了部分代码的更新版本。当我get_r_profile用于分析时，每次调用 0.073 微秒get_r，而每次调用MC_criterion_profile0.164 微秒get_r，相差 50%，这似乎与return r.

from libc.math cimport sin, cos, acos, exp, sqrt, fabs, M_PI

cdef class thesis_funcs:

    cdef inline double get_r(self, double g, double omega):
        cdef double l1, l2, A, r, g2, gs, cos_oa2, sin_oa2
        if g == 0: return 1
        omega = omega*SCALEDPI
        g2 = g*g
        gs = g*sqrt(4 + g2)
        l1 = 0.5*(2 + g2 + gs)
        l2 = l1 - gs
        A = acos(g/sqrt(g2 + square(1 - l2)))
        cos_oa2 = square(cos(omega - A))
        sin_oa2 = 1 - cos_oa2
        r = 1.0/sqrt(sqrt(l2)*cos_oa2 + sqrt(l1)*sin_oa2)
        return r

    @cython.profile(False)
    cdef inline double get_mu(self, double r, double mu0, double mu1):
        return mu0*exp(-mu1*(r - 1))

    def get_r_profile(self): # Profiling through this guy gives me 0.073 microsec/call
        cdef unsigned int i
        for i from 0 <= i < 10000000:
            self.get_r(3.0, 165)

    def MC_criterion(self, double g, double omega, double mu0, double mu1, double C = 0.0):
        cdef double r, mu, theta, res
        r = self.get_r(g, omega)
        mu = self.get_mu(r, mu0, mu1)
        theta = 45 - omega
        theta = theta*SCALEDPI
        res = fabs(g*sin(2.0*theta)) - mu*(1 + g*cos(2.0*theta)) - C
        return res

    def MC_criterion_profile(self): # Profiling through this one gives 0.164 microsec/call
        cdef double g, omega, mu0, mu1
        cdef unsigned int i
        omega = 165
        mu0 = 0.6
        mu1 = 2.0
        g = 3.0
        for i from 1 <= i < 10000000:
            self.MC_criterion(g, omega, mu0, mu1)

我认为两者之间可能存在根本差异，get_r_profile并且MC_criterion会导致额外的间接费用。你能发现吗？

score 5 · Accepted Answer

根据您的评论，线计算r是最昂贵的。如果是这种情况，那么我怀疑是 trig 函数调用正在扼杀性能。

通过毕达哥拉斯，cos(x)**2 + sin(x)**2 == 1您可以通过计算跳过其中一个调用

cos_oa2 = cos(omega - A)**2
sin_oa2 = 1 - cos_oa2
r = 1. / sqrt(sqrt(l2) * cos_oa2 + sqrt(l1) * sin_oa2)

（或者也许翻转它们：在我的机器上，sin似乎比cos. 快。不过可能是 NumPy 故障。）

score 2 · Accepted Answer

的输出

cython -a

表明除以 0 被测试。如果您 200% 确定它不会发生，您可能希望删除此检查。

要使用 C 部门，您可以将以下指令添加到文件顶部：

# cython: cdivision=True

我会链接官方文档，但我现在无法访问它。您在这里有一些信息（p15）：https ://python.g-node.org/python-summerschool-2011/_media/materials/cython/cython-slides.pdf

score 1 · Accepted Answer

这个答案与 Cython 无关，但应该提到一些可能有帮助的点。

在真正知道是否需要它们之前定义变量可能并不理想。在“if g == 0”语句之后移动“cdef double l1, l2, A, r, g2, gs”。

我会确保从“omega = omega*M_PI/180”开始，M_PI/180 部分只计算一次。例如一些 Python 代码：

import timeit
from math import pi

def calc1( omega ):
    return omega * pi / 180

SCALEDPI = pi / 180
def calc2( omega ):
    return omega * SCALEDPI

if __name__ == '__main__':
    took = timeit.repeat( stmt = lambda:calc1( 5 ), number = 10000 )
    print "Min. time: %.4f Max. time: %.4f" % ( min( took ), max( took ) )
    took = timeit.repeat( stmt = lambda:calc2( 5 ), number = 10000 )
    print "Min. time: %.4f Max. time: %.4f" % ( min( took ), max( took ) )

calc1：最小。时间：0.0033 最大。时间：0.0034

calc2：最小。时间：0.0029 最大。时间：0.0029

尝试自己优化计算。它们看起来相当复杂，我觉得它们可以简化。

python - 在 Cython 中优化代码的技巧

3 回答 3

Related

Reference