python - 为什么连接两个字符串时 Python 比 C 快？

Question

目前我想比较 Python 和 C 在用于字符串处理时的速度。我认为 C 应该比 Python 提供更好的性能；但是，我得到了完全相反的结果。

这是C程序：

#include <unistd.h>
#include <sys/time.h>

#define L (100*1024)

char s[L+1024];
char c[2*L+1024];

double time_diff( struct timeval et, struct timeval st )
{
    return 1e-6*((et.tv_sec - st.tv_sec)*1000000 + (et.tv_usec - st.tv_usec ));
}

int foo()
{
    strcpy(c,s);
    strcat(c+L,s);
    return 0;
}

int main()
{
    struct timeval st;
    struct timeval et;
    int i;
    //printf("s:%x\nc:%x\n", s,c);

    //printf("s=%d c=%d\n", strlen(s), strlen(c));
    memset(s, '1', L);
    //printf("s=%d c=%d\n", strlen(s), strlen(c));
    foo();
    //printf("s=%d c=%d\n", strlen(s), strlen(c));
    //s[1024*100-1]=0;

    gettimeofday(&st,NULL);
    for( i = 0 ; i < 1000; i++ ) foo();
    gettimeofday(&et,NULL);

    printf("%f\n", time_diff(et,st));
    return 0;
}

这是 Python 的：

import time

s = '1'*102400
def foo():
    c = s + s
    #assert( len(c) == 204800 )

st = time.time()
for x in xrange(1000):
    foo()
et = time.time()

print (et-st)

我得到了什么：

root@xkqeacwf:~/lab/wfaster# python cp100k.py 
0.027932882309
root@xkqeacwf:~/lab/wfaster# gcc cp100k.c
root@xkqeacwf:~/lab/wfaster# ./a.out 
0.061820

那有意义吗？还是我只是犯了一些愚蠢的错误？

score 18 · Accepted Answer

累积的评论（主要来自我）转换为答案：

如果您使用对字符串长度的了解并使用memmove()ormemcpy()代替strcpy()and会发生strcat()什么？（我注意到strcat()可以替换为strcpy()结果没有差异 - 检查时间可能会很有趣。）另外，您没有包含<string.h>（或<stdio.h>）所以您错过了<string.h>可能提供的任何优化！

Marcus：是的，memmove()它strcpy()比 Python 更快，但为什么呢？一次做memmove()一个字宽的副本吗？

是的; 在 64 位机器上，为了很好地对齐数据，它可以一次移动 64 位而不是一次移动 8 位；一台 32 位机器，一次可能是 32 位。它也~~只有一个~~更简单的测试可以在每次迭代（计数）时进行，而不是（~~'计数还是空字节'~~）'这是一个空字节'。

马库斯：但memmove()即使在我制作之后仍然运作良好L=L-13，并且sizeof(s)发出了L+1024-13。我的机器有一个sizeof(int)==4.

的代码memmove()是高度优化的汇编程序，可能是内联的（没有函数调用开销，尽管对于 100KiB 的数据，函数调用开销是最小的）。好处来自更大的移动和更简单的循环条件。

马库斯：那么 Python 也使用memmove()，还是有什么神奇的？

我没有查看 Python 源代码，但实际上可以肯定它会跟踪其字符串的长度（它们以空值结尾，但 Python 总是知道字符串的活动部分有多长）。知道长度允许 Python 使用memmove()or memcpy()（区别在于memmove()即使源和目标重叠也能正常工作；memcpy()如果它们重叠，则不必正常工作）。他们获得比memmove/memcpy可用速度更快的东西相对不太可能。

我修改了 C 代码以在我的机器（Mac OS X 10.7.4、8 GiB 1333 MHz RAM、2.3 GHz Intel Core i7、GCC 4.7.1）上为我生成更稳定的时序，并比较strcpy()和strcat()vsmemcpy()与memmove(). 请注意，我将循环计数从 1000 增加到 10000 以提高计时的稳定性，并且我将整个测试（所有三种机制）重复了 10 次。可以说，计时循环计数应该增加另一个 5-10 倍，以便计时超过一秒。

#include <stdio.h>
#include <string.h>
#include <unistd.h>
#include <sys/time.h>

#define L (100*1024)

char s[L+1024];
char c[2*L+1024];

static double time_diff( struct timeval et, struct timeval st )
{
    return 1e-6*((et.tv_sec - st.tv_sec)*1000000 + (et.tv_usec - st.tv_usec ));
}

static int foo(void)
{
    strcpy(c,s);
    strcat(c+L,s);
    return 0;
}

static int bar(void)
{
    memcpy(c + 0, s, L);
    memcpy(c + L, s, L);
    return 0;
}

static int baz(void)
{
    memmove(c + 0, s, L);
    memmove(c + L, s, L);
    return 0;
}

static void timer(void)
{
    struct timeval st;
    struct timeval et;
    int i;

    memset(s, '1', L);
    foo();

    gettimeofday(&st,NULL);
    for( i = 0 ; i < 10000; i++ )
        foo();
    gettimeofday(&et,NULL);
    printf("foo: %f\n", time_diff(et,st));

    gettimeofday(&st,NULL);
    for( i = 0 ; i < 10000; i++ )
        bar();
    gettimeofday(&et,NULL);
    printf("bar: %f\n", time_diff(et,st));

    gettimeofday(&st,NULL);
    for( i = 0 ; i < 10000; i++ )
        baz();
    gettimeofday(&et,NULL);
    printf("baz: %f\n", time_diff(et,st));
}

int main(void)
{
    for (int i = 0; i < 10; i++)
        timer();
    return 0;
}

编译时不会发出警告：

gcc -O3 -g -std=c99 -Wall -Wextra -Wmissing-prototypes -Wstrict-prototypes \
    -Wold-style-definition cp100k.c -o cp100k

我得到的时机是：

foo: 1.781506
bar: 0.155201
baz: 0.144501
foo: 1.276882
bar: 0.187883
baz: 0.191538
foo: 1.090962
bar: 0.179188
baz: 0.183671
foo: 1.898331
bar: 0.142374
baz: 0.140329
foo: 1.516326
bar: 0.146018
baz: 0.144458
foo: 1.245074
bar: 0.180004
baz: 0.181697
foo: 1.635782
bar: 0.136308
baz: 0.139375
foo: 1.542530
bar: 0.138344
baz: 0.136546
foo: 1.646373
bar: 0.185739
baz: 0.194672
foo: 1.284208
bar: 0.145161
baz: 0.205196

奇怪的是，如果我放弃“无警告”并省略<string.h>和<stdio.h>标题，就像在原始发布的代码中一样，我得到的时间是：

foo: 1.432378
bar: 0.123245
baz: 0.120716
foo: 1.149614
bar: 0.186661
baz: 0.204024
foo: 1.529690
bar: 0.104873
baz: 0.105964
foo: 1.356727
bar: 0.150993
baz: 0.135393
foo: 0.945457
bar: 0.173606
baz: 0.170719
foo: 1.768005
bar: 0.136830
baz: 0.124262
foo: 1.457069
bar: 0.130019
baz: 0.126566
foo: 1.084092
bar: 0.173160
baz: 0.189040
foo: 1.742892
bar: 0.120824
baz: 0.124772
foo: 1.465636
bar: 0.136625
baz: 0.139923

目测这些结果，它似乎比“更干净”的代码更快，虽然我没有对两组数据运行学生 t 检验，而且时间有很大的可变性（但我确实有像 Boinc 运行的东西8个后台进程）。在代码的早期版本中，这种效果似乎更加明显，当时它是公正的strcpy()并且strcat()经过测试。我没有解释，如果它是一个真正的效果！

mvds的跟进

由于问题已关闭，我无法正确回答。在几乎什么都不做的 Mac 上，我得到了这些时间：

（带标题）

foo: 1.694667 bar: 0.300041 baz: 0.301693
foo: 1.696361 bar: 0.305267 baz: 0.298918
foo: 1.708898 bar: 0.299006 baz: 0.299327
foo: 1.696909 bar: 0.299919 baz: 0.300499
foo: 1.696582 bar: 0.300021 baz: 0.299775

（没有标题，忽略警告）

foo: 1.185880 bar: 0.300287 baz: 0.300483
foo: 1.120522 bar: 0.299585 baz: 0.301144
foo: 1.122017 bar: 0.299476 baz: 0.299724
foo: 1.124904 bar: 0.301635 baz: 0.300230
foo: 1.120719 bar: 0.300118 baz: 0.299673

预处理器输出（-E标志）显示包含标头会转换strcpy为内置调用，例如：

((__builtin_object_size (c, 0) != (size_t) -1) ? __builtin___strcpy_chk (c, s, __builtin_object_size (c, 2 > 1)) : __inline_strcpy_chk (c, s));
((__builtin_object_size (c+(100*1024), 0) != (size_t) -1) ? __builtin___strcat_chk (c+(100*1024), s, __builtin_object_size (c+(100*1024), 2 > 1)) : __inline_strcat_chk (c+(100*1024), s));

所以 libc 版本的 strcpy 优于 gcc 内置。（使用gdb它很容易验证断点strcpy确实不会在strcpy()调用中中断，如果包含标题）

在 Linux（Debian 5.0.9，amd64）上，差异似乎可以忽略不计。生成的程序集（-S标志）仅在包含所携带的调试信息上有所不同。

score 7 · Accepted Answer

我相信这样做的原因是 Python 字符串不是以空值结尾的。

在 Python 中，字符串长度存储在字符串旁边，允许它在连接字符串时跳过 strcat() 使用的隐式 strlen()。

加上字符串连接是直接在 C 中为 Python 实现的，这可能是原因。

编辑：好吧，现在我实际上查看了 C 代码并看到它使用静态缓冲区，我也很困惑，因为我看不出 Python 如何避免动态分配，这应该慢得多......

python - 为什么连接两个字符串时 Python 比 C 快？

2 回答 2

Related

Reference