12

I wrote a piece of C code to show a point in a discussion about optimizations and branch prediction. Then I noticed even more diverse outcome than I did expect. My goal was to write it in a language that is common subset between C++ and C, that is standard-compliant for both languages and that is fairly portable. It was tested on different Windows PCs:

#include <stdio.h>
#include <time.h>

/// @return - time difference between start and stop in milliseconds
int ms_elapsed( clock_t start, clock_t stop )
{
    return (int)( 1000.0 * ( stop - start ) / CLOCKS_PER_SEC );
}

int const Billion = 1000000000;
/// & with numbers up to Billion gives 0, 0, 2, 2 repeating pattern 
int const Pattern_0_0_2_2 = 0x40000002; 

/// @return - half of Billion  
int unpredictableIfs()
{
    int sum = 0;
    for ( int i = 0; i < Billion; ++i )
    {
        // true, true, false, false ...
        if ( ( i & Pattern_0_0_2_2 ) == 0 )
        {
            ++sum;
        }
    }
    return sum;
}

/// @return - half of Billion  
int noIfs()
{
    int sum = 0;
    for ( int i = 0; i < Billion; ++i )
    {
        // 1, 1, 0, 0 ...
        sum += ( i & Pattern_0_0_2_2 ) == 0;
    }
    return sum;
}

int main()
{
    clock_t volatile start;
    clock_t volatile stop;
    int volatile sum;
    printf( "Puzzling measurements:\n" );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = noIfs();
    stop = clock();
    printf( "Same without ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );

    start = clock();
    sum = unpredictableIfs();
    stop = clock();
    printf( "Unpredictable ifs took %d msec; answer was %d\n"
          , ms_elapsed(start, stop), sum );
}

Compiled with VS2010; /O2 optimizations Intel Core 2, WinXP results:

Puzzling measurements:
Unpredictable ifs took 1344 msec; answer was 500000000
Unpredictable ifs took 1016 msec; answer was 500000000
Same without ifs took 1031 msec; answer was 500000000
Unpredictable ifs took 4797 msec; answer was 500000000

Edit: Full switches of compiler:

/Zi /nologo /W3 /WX- /O2 /Oi /Oy- /GL /D "WIN32" /D "NDEBUG" /D "_CONSOLE" /D "_UNICODE" /D "UNICODE" /Gm- /EHsc /GS /Gy /fp:precise /Zc:wchar_t /Zc:forScope /Fp"Release\Trying.pch" /Fa"Release\" /Fo"Release\" /Fd"Release\vc100.pdb" /Gd /analyze- /errorReport:queue

Other person posted such ... Compiled with MinGW, g++ 4.71, -O1 optimizations Intel Core 2, WinXP results:

Puzzling measurements:
Unpredictable ifs took 1656 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000
Same without ifs took 1969 msec; answer was 500000000
Unpredictable ifs took 0 msec; answer was 500000000

Also he posted such results for -O3 optimizations:

Puzzling measurements:
Unpredictable ifs took 1890 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000
Same without ifs took 1422 msec; answer was 500000000
Unpredictable ifs took 2516 msec; answer was 500000000

Now I have question. What is going on here?

More specifically ... How can a fixed function take so different amounts of time? Is there something wrong in my code? Is there something tricky with Intel processor? Are the compilers doing something odd? Can it be because of 32 bit code ran on 64 bit processor?

Thanks for attention!

Edit: I accept that g++ -O1 just reuses returned values in 2 other calls. I also accept that g++ -O2 and g++ -O3 have defect that leaves the optimization out. Significant diversity of measured speeds (450% !!!) seems still mysterious.

I looked at disassembly of code produced by VS2010. It did inline unpredictableIfs 3 times. The inlined code was fairly similar; the loop was same. It did not inline noIfs. It did roll noIfs out a bit. It takes 4 steps in one iteration. noIfs calculate like was written while unpredictableIfs use jne to jump over increment.

4

3 回答 3

13

使用-O1,gcc-4.7.1unpredictableIfs只调用一次并重用结果,因为它识别出它是一个纯函数,所以每次调用的结果都是一样的。(我的做到了,通过查看生成的程序集进行了验证。)

在更高的优化级别下,函数是内联的,编译器不再识别它是相同的代码,所以每次在源代码中出现函数调用时都会运行它。

除此之外,我的 gcc-4.7.1unpredictableIfs在使用-O1or时处理-O2最好(除了重用问题,两者都产生相同的代码),而使用. 然而,相同代码的不同运行之间的时间在这里是一致的——相等或相差 10 毫秒(粒度为),所以我不知道是什么导致您报告的时间大不相同。noIfs-O3clockunpredictableIfs-O3

使用-O2时,for 循环unpredictableIfs与使用生成的代码相同-O1(寄存器交换除外):

.L12:
    movl    %eax, %ecx
    andl    $1073741826, %ecx
    cmpl    $1, %ecx
    adcl    $0, %edx
    addl    $1, %eax
    cmpl    $1000000000, %eax
    jne .L12

因为noIfs它是相似的:

.L15:
    xorl    %ecx, %ecx
    testl   $1073741826, %eax
    sete    %cl
    addl    $1, %eax
    addl    %ecx, %edx
    cmpl    $1000000000, %eax
    jne .L15

它在哪里

.L7:
    testl   $1073741826, %edx
    sete    %cl
    movzbl  %cl, %ecx
    addl    %ecx, %eax
    addl    $1, %edx
    cmpl    $1000000000, %edx
    jne .L7

-O1. 两个循环的运行时间相似,unpredictableIfs速度稍快一些。

随着-O3, for 循环unpredictableIfs变得更糟,

.L14:
    leal    1(%rdx), %ecx
    testl   $1073741826, %eax
    cmove   %ecx, %edx
    addl    $1, %eax
    cmpl    $1000000000, %eax
    jne     .L14

对于noIfs(包括此处的设置代码),它变得更好:

    pxor    %xmm2, %xmm2
    movq    %rax, 32(%rsp)
    movdqa  .LC3(%rip), %xmm6
    xorl    %eax, %eax
    movdqa  .LC2(%rip), %xmm1
    movdqa  %xmm2, %xmm3
    movdqa  .LC4(%rip), %xmm5
    movdqa  .LC5(%rip), %xmm4
    .p2align 4,,10
    .p2align 3
.L18:
    movdqa  %xmm1, %xmm0
    addl    $1, %eax
    paddd   %xmm6, %xmm1
    cmpl    $250000000, %eax
    pand    %xmm5, %xmm0
    pcmpeqd %xmm3, %xmm0
    pand    %xmm4, %xmm0
    paddd   %xmm0, %xmm2
    jne .L18

.LC2:
    .long   0
    .long   1
    .long   2
    .long   3
    .align 16
.LC3:
    .long   4
    .long   4
    .long   4
    .long   4
    .align 16
.LC4:
    .long   1073741826
    .long   1073741826
    .long   1073741826
    .long   1073741826
    .align 16
.LC5:
    .long   1
    .long   1
    .long   1
    .long   1

它一次计算四次迭代,因此noIfs运行速度几乎是当时的四倍。

于 2013-02-19T19:25:40.983 回答
2

对了,看看64位Linux上gcc的汇编代码,第一种情况,加上-O1,UnpredictableIfs确实只调用了一次函数,结果被复用了。

使用 -O2 和 -O3 时,函数是内联的,所花费的时间应该相同。任何一段代码中也没有实际的分支,但是这两位代码的翻译有些不同,我已经删除了更新“sum”的行[%edx在这两种情况下]

不可预测的如果:

movl    %eax, %ecx
andl    $1073741826, %ecx
cmpl    $1, %ecx
adcl    $0, %edx
addl    $1, %eax

否:

xorl    %ecx, %ecx
testl   $1073741826, %eax
sete    %cl
addl    $1, %eax
addl    %ecx, %edx

正如你所看到的,它并不完全相同,但它做的事情非常相似。

于 2013-02-19T19:31:40.523 回答
2

关于 Windows 上的结果范围(从 1016 毫秒到 4797 毫秒):您应该知道clock()在 MSVC 中返回elapsed wall time。该标准说应该返回进程花费的 CPU 时间clock()的近似值,而其他实现在这方面做得更好。

鉴于 MSVC 提供了挂墙时间,如果您的进程在运行一次测试迭代时被抢占,即使代码在大约相同的 CPU 时间中运行,它也可能会产生更大的结果。

另请注意,clock()在许多 Windows PC 上,分辨率非常差,通常为 11-19 毫秒。您已经完成了足够多的迭代,只有大约 1%,所以我认为这不是差异的一部分,但是在尝试编写基准测试时要注意这一点是很好的。我知道您追求便携性,但如果您需要在 Windows 上进行更好的测量,您可以使用QueryPerformanceCounter它几乎肯定会为您提供更好的分辨率,尽管它仍然只是经过墙上的时间。

更新:在我了解到一个案例的长时间运行持续发生后,我启动了 VS2010 并重现了结果。对于某些运行,我通常会得到大约 1000 毫秒的时间,其他一些是 750 毫秒,而对于那些莫名其妙的运行,我通常会得到 5000 多毫秒。

观察:

  1. 在所有情况下,都内联了不可预测的Ifs() 代码。
  2. 删除 noIfs() 代码没有影响(因此长时间不是该代码的副作用)。
  3. 将线程关联设置为单个处理器没有效果。
  4. 5000 毫秒的时间总是后面的实例。我注意到后面的实例在循环开始 之前lea ecx,[ecx]有一个额外的指令: . 我不明白为什么会有 5 倍的差异。除此之外,早期和后来的实例是相同的代码。
  5. volatilestartstop变量中删除产生更少的长时间运行,更多的 750 ms 运行,并且没有 1000 ms 运行。(生成的循环代码现在在所有情况下看起来都完全相同,而不是leas。)
  6. volatile从变量中删除sum(但为时钟计时器保留它),长时间运行可以发生在任何位置。
  7. 如果删除所有volatile限定符,您将获得一致、快速(750 毫秒)的运行。(代码看起来与早期的相同,但edi被选择为sum而不是ecx。)

我不确定从这一切中可以得出什么结论,除非它volatile会对 MSVC 产生不可预测的性能影响,因此您应该仅在必要时应用它。

更新 2:我看到与使用 volatile 相关的一致的运行时差异,即使反汇编几乎相同。

具有挥发性:

Puzzling measurements:
Unpredictable ifs took 643 msec; answer was 500000000
Unpredictable ifs took 1248 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 4611 msec; answer was 500000000
Unpredictable ifs took 4706 msec; answer was 500000000
Unpredictable ifs took 4516 msec; answer was 500000000
Unpredictable ifs took 4382 msec; answer was 500000000

每个实例的反汇编如下所示:

    start = clock();
010D1015  mov         esi,dword ptr [__imp__clock (10D20A0h)]  
010D101B  add         esp,4  
010D101E  call        esi  
010D1020  mov         dword ptr [start],eax  
    sum = unpredictableIfs();
010D1023  xor         ecx,ecx  
010D1025  xor         eax,eax  
010D1027  test        eax,40000002h  
010D102C  jne         main+2Fh (10D102Fh)  
010D102E  inc         ecx  
010D102F  inc         eax  
010D1030  cmp         eax,3B9ACA00h  
010D1035  jl          main+27h (10D1027h)  
010D1037  mov         dword ptr [sum],ecx  
    stop = clock();
010D103A  call        esi  
010D103C  mov         dword ptr [stop],eax  

没有挥发物:

Puzzling measurements:
Unpredictable ifs took 644 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 624 msec; answer was 500000000
Unpredictable ifs took 605 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000
Unpredictable ifs took 599 msec; answer was 500000000

    start = clock();
00321014  mov         esi,dword ptr [__imp__clock (3220A0h)]  
0032101A  add         esp,4  
0032101D  call        esi  
0032101F  mov         dword ptr [start],eax  
    sum = unpredictableIfs();
00321022  xor         ebx,ebx  
00321024  xor         eax,eax  
00321026  test        eax,40000002h  
0032102B  jne         main+2Eh (32102Eh)  
0032102D  inc         ebx  
0032102E  inc         eax  
0032102F  cmp         eax,3B9ACA00h  
00321034  jl          main+26h (321026h)  
    stop = clock();
00321036  call        esi
// The only optimization I see is here, where eax isn't explicitly stored
// in stop but is instead immediately used to compute the value for the
// printf that follows.

除了寄存器选择之外,我没有看到显着的差异。

于 2013-02-20T13:56:49.653 回答