python - 为什么我的 python/numpy 示例比纯 C 实现更快？

Question

我在 python 和 C 中有几乎相同的代码。 Python 示例：

import numpy
nbr_values = 8192
n_iter = 100000

a = numpy.ones(nbr_values).astype(numpy.float32)
for i in range(n_iter):
    a = numpy.sin(a)

C 示例：

#include <stdio.h>
#include <math.h>
int main(void)
{
  int i, j;
  int nbr_values = 8192;
  int n_iter = 100000;
  double x;  
  for (j = 0; j < nbr_values; j++){
    x = 1;
    for (i=0; i<n_iter; i++)
    x = sin(x);
  }
  return 0;
}

当我运行这两个示例时，发生了一些奇怪的事情：

$ time python numpy_test.py 
real    0m5.967s
user    0m5.932s
sys     0m0.012s

$ g++ sin.c
$ time ./a.out 
real    0m13.371s
user    0m13.301s
sys     0m0.008s

看起来python/numpy比C快两倍。上面的实验有没有错误？你怎么解释呢？

PS 我有 Ubuntu 12.04、8G 内存、核心 i5 顺便说一句

score 19 · Accepted Answer

First, turn on optimization. Secondly, subtleties matter. Your C code is definitely not 'basically the same'.

Here is equivalent C code:

sinary2.c:

#include <math.h>
#include <stdlib.h>

float *sin_array(const float *input, size_t elements)
{
    int i = 0;
    float *output = malloc(sizeof(float) * elements);
    for (i = 0; i < elements; ++i) {
        output[i] = sin(input[i]);
    }
    return output;
}

sinary.c:

#include <math.h>
#include <stdlib.h>

extern float *sin_array(const float *input, size_t elements)

int main(void)
{
    int i;
    int nbr_values = 8192;
    int n_iter = 100000;
    float *x = malloc(sizeof(float) * nbr_values);  
    for (i = 0; i < nbr_values; ++i) {
        x[i] = 1;
    }
    for (i=0; i<n_iter; i++) {
        float *newary = sin_array(x, nbr_values);
        free(x);
        x = newary;
    }
    return 0;
}

Results:

$ time python foo.py 

real    0m5.986s
user    0m5.783s
sys 0m0.050s
$ gcc -O3 -ffast-math sinary.c sinary2.c -lm
$ time ./a.out 

real    0m5.204s
user    0m4.995s
sys 0m0.208s

The reason the program has to be split in two is to fool the optimizer a bit. Otherwise it will realize that the whole loop has no effect at all and optimize it out. Putting things in two files doesn't give the compiler visibility into the possible side-effects of sin_array when it's compiling main and so it has to assume that it actually has some and repeatedly call it.

Your original program is not at all equivalent for several reasons. One is that you have nested loops in the C version and you don't in Python. Another is that you are working with arrays of values in the Python version and not in the C version. Another is that you are creating and discarding arrays in the Python version and not in the C version. And lastly you are using float in the Python version and double in the C version.

Simply calling the sin function the appropriate number of times does not make for an equivalent test.

Also, the optimizer is a really big deal for C. Comparing C code on which the optimizer hasn't been used to anything else when you're wondering about a speed comparison is the wrong thing to do. Of course, you also need to be mindful. The C optimizer is very sophisticated and if you're testing something that really doesn't do anything, the C optimizer might well notice this fact and simply not do anything at all, resulting in a program that's ridiculously fast.

score 2 · Accepted Answer

因为“numpy”是为速度而实现的专用数学库。C 具有 sin/cos 的标准函数，这些函数通常是为了准确性而导出的。

您也没有将苹果与苹果进行比较，因为您在 C 中使用 double，在 python 中使用 float32 (float)。如果我们将python代码改为计算float64，在我的机器上时间增加了大约2.5秒，使其与正确优化的C版本大致匹配。

如果整个测试是为了做一些更复杂的事情，需要更多的控制结构（if/else、do/while 等），那么你可能会看到 C 和 Python 之间的差异更小——因为 C 编译器真的做不到“罪”更快 - 除非您实现更好的“罪”功能。

较新的事实是您的代码在双方都不完全相同... ;)

score 0 · Accepted Answer

您似乎在 C 8192 x 10000 次中执行相同的操作，但在 python 中只有 10000 次（我之前没有使用过 numpy，所以我可能会误解代码）。为什么在 python 案例中使用数组（同样我不习惯 numpy，所以取消引用可能是隐式的）。如果您希望使用数组，请注意双精度数在缓存和优化矢量化方面的性能影响 - 您在两种实现之间使用不同的类型（浮点数与双精度数），但考虑到算法，我认为这并不重要。

围绕 C 与 Pythis、Pythat 的许多异常性能基准问题的主要原因是 C 实现通常很差。

https://www.ibm.com/developerworks/community/blogs/jfp/entry/A_Comparison_Of_C_Julia_Python_Numba_Cython_Scipy_and_BLAS_on_LU_Factorization?lang=en

如果您注意到这个人编写 C 来处理一个双精度数组（在他可以使用的地方不使用限制或 const 关键字），他会通过优化构建然后强制编译器使用 SIMD 而不是 AVE。简而言之，如果他想要性能，编译器也会为双精度和错误类型的寄存器使用效率低下的指令集 - 您可以确定 numba 和 numpy 将使用尽可能多的花里胡哨，并且将附带非常高效的 C和 C++ 库开始。简而言之，如果您想要 C 的速度，您必须考虑它，您甚至可能必须反汇编代码，并且可能禁用优化并改用编译器内在函数。它为您提供了执行此操作的工具，因此不要指望编译器会为您完成所有工作。如果您想要那种自由度，请使用 Cython、Numba、Numpy、Scipy 等。他们

这是一篇关于这些要点的非常好的文章（我会使用 SciPy）：

https://www.scipy.org/scipylib/faq.html

python - 为什么我的 python/numpy 示例比纯 C 实现更快？

3 回答 3

Related

Reference