2

我遇到了对 cblas 的调用(即 daxpy: perform y += alpha * x 其中 y 和 x 是相同长度的向量,alpha 是标量)和在纯 cython 中执行的相同操作之间的性能差异。

这是我的基准,使用这个 SO 问题逐行分析 cython 代码:

%load_ext line_profiler
import line_profiler
%load_ext Cython

from Cython.Compiler.Options import directive_defaults

directive_defaults['linetrace'] = True
directive_defaults['binding'] = True

我要进行基准测试的代码段:

%%cython -a -f --compile-args=-DCYTHON_TRACE=1


import numpy as np
cimport cython
from scipy.linalg.cython_blas cimport daxpy


@cython.boundscheck(False)
def bench(double[:, ::1] G, double[::1, :] Q, double[:] x):
    cdef int T = G.shape[1]
    cdef int P = G.shape[0]

    cdef int inc = 1  # increment for blas
    cdef int p
    cdef int t
    cdef int i = 42

    for _ in range(50):  # 50 repetitions
    # First version:
        for p in range(P):
            for t in range(T):
                G[p, t] += Q[p, i] * x[t]

        # second version
        for p in range(P):
            daxpy(&T, &Q[p, i], &x[0], &inc, &G[p, 0], &inc)

基准测试的结果是:

Timer unit: 1e-06 s

Total time: 18.543 s
File: /home/mathurin/.cache/ipython/cython/_cython_magic_32a150cd3ff68e78f896ad9eb33dda69.pyx
Function: bench at line 9

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
     9                                           def bench(double[:, ::1] G, double[::1, :] Q, double[:] x):
    10         1            1      1.0      0.0      cdef int T = G.shape[1]
    11         1            0      0.0      0.0      cdef int P = G.shape[0]
    12                                               
    13         1            1      1.0      0.0      cdef int inc = 1  # increment for blas
    14                                               cdef int p
    15                                               cdef int t
    16         1            0      0.0      0.0      cdef int i = 42
    17                                               # First version:
    18         1            0      0.0      0.0      for _ in range(50):  # 50 repetitions
    19        50           22      0.4      0.0          for p in range(P):
    20      9000         2512      0.3      0.0              for t in range(T):
    21  63000000     **18002618**      0.3     97.1                  G[p, t] += Q[p, i] * x[t]
    22                                           
    23                                                   # second version
    24        50           15      0.3      0.0          for p in range(P):
    25      9000       **537865**     59.8      2.9              daxpy(&T, &Q[p, i], &x[0], &inc, &G[p, 0], &inc)

第 21 和 25 行显示 for 循环比 blas 调用慢 40 倍。如果我没记错的话,我会以正确的顺序遍历数组(以最大化缓存命中)。我希望 BLAS 更快,但不会这么快。我缺少一个明显的解释吗?

用于获取时间的片段是:

np.random.seed(2407)
G = np.random.randn(180, 7000)
Q = np.random.randn(50, 50)
Q += Q.T
Q = np.asfortranarray(Q)
x = np.random.randn(50)

import pstats, cProfile
profile = line_profiler.LineProfiler(bench)
profile.runcall(bench, G, Q, x)

profile.print_stats()
4

0 回答 0