c++ - How to understand the tricky speed up

Question

Sorry for may be too abstract question, but for me it is quite practical + may be some experts had similar experience and can explain it.

I have a big code, about 10000 lines size.

I notices that if in a certain place I put

if ( expression ) continue;

where expression is always false (double checked with logic of code and cout), but depends on unknown parameters (so compiler can't simply rid of this line during compilation) the speed of the program is increased by 25% (the result of calculation are the same). If I measure speed of the loop itself the speed up factor is bigger than 3.

Why can this happen and what is possible ways to use this speed up possibility without such tricks?

P.S. I use gcc 4.7.3, -O3 optimisation.

More info:

I have tried two different expressions, both works.

If I change the line to:

if ( expression ) { cout << " HELLO " << endl; continue; };

the speed up is gone.

If I change the line to:
```
expression;
```
the speed up is gone.

The code, which surrounds the line looks like this:

for ( int i = a; ;  ) {
  do {
    i += d;
    if ( d*i > d*ilast ) break;

      // small amount of calculations, and conditional calls of continue;

  } while ( expression0 );
  if ( d*i > dir*ilast ) break;

  if ( expression ) continue;

   // very big amount calculations, and conditional calls of continue;

}

the for loop looks strange. It is because I have modified the loops in order to catch this bottle neck. Initially expression was equal to expression0 and instead of do-loop I had only this continue.

I tried use __builtin_expect in order to understand branch prediction. With

  // the expression (= false) is supposed to be true by branch prediction.
if ( __builtin_expect( !!(expression), 1) ) continue;

the speed up is 25%.

  // the expression (= false) is supposed to be false by branch prediction.
if ( __builtin_expect( !!(expression), 0) ) continue;

the speed up is gone.

If I use -O2 instead of -O3 the effect is gone. The code is slightly (~3%) slower than the fast O3-version with the false condition.
Same for "-O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -ftree-vectorize". With one more option: "-O2 -finline-functions -funswitch-loops -fpredictive-commoning -fgcse-after-reload -ftree-vectorize -fipa-cp-clone" the effect is amplified. With "the line" the speed is same, without "the line" the code is 75% slower.

The reason is in just following conditional operator. So the code looks like this:

for ( int i = a; ;  ) {

      // small amount of calculations, and conditional calls of continue;

  if ( expression ) continue;

    // calculations1

  if ( expression2 ) {
    // calculations2
  }

   // very big amount calculations, and conditional calls of continue;

}

The value of expression2 is almost always false. So I changed it like this:

for ( int i = a; ;  ) {

      // small amount of calculations, and conditional calls of continue;

  // if ( expression ) continue; // don't need this anymore

    // calculations1

  if ( __builtin_expect( !!(expression2), 0 ) ) { // suppose expression2 == false
    // calculations2
  }

   // very big amount calculations, and conditional calls of continue;

}

And have got desired 25% speed up. Even a little bit more. And behaviour no longer depends on the critical line.

If somebody knows materials, which can explain this behaviour without guesses I will be very glad to read and accept their answer.

score 3 · Accepted Answer

找到了。

原因在于下面的条件运算符。所以代码看起来像这样：

for ( int i = a; ;  ) {

      // small amount of calculations, and conditional calls of continue;

  if ( expression ) continue;

    // calculations1

  if ( expression2 ) {
    // calculations2
  }

   // very big amount calculations, and conditional calls of continue;

}

expression2 的值几乎总是假的。所以我把它改成这样：

for ( int i = a; ;  ) {

      // small amount of calculations, and conditional calls of continue;

  // if ( expression ) continue; // don't need this anymore

    // calculations1

  if ( __builtin_expect( !!(expression2), 0 ) ) { // suppose expression2 == false
    // calculations2
  }

   // very big amount calculations, and conditional calls of continue;

}

并获得了理想的 25% 加速。甚至更多一点。并且行为不再取决于临界线。

我不确定如何解释它，也找不到足够的分支预测材料。

但我想重点是应该跳过计算2，但编译器不知道这一点，默认情况下假设表达式2 == true。同时假设在简单的继续检查中

if ( expression ) continue;

表达式 == false，并且很好地跳过了计算2，因为在任何情况下都必须这样做。如果我们有更复杂的操作（例如 cout），则假设表达式为真，并且该技巧不起作用。

如果有人知道材料，可以不用猜测就能解释这种行为，我会很高兴阅读并接受他们的回答。

score 0 · Accepted Answer

我不想这么说，但答案将是非常技术性的，更重要的是，非常具体到您的代码。如此之多，以至于除了您自己之外，可能没有人会花时间来调查您问题的根源。正如其他人所建议的那样，它很可能取决于分支预测和其他与流水线相关的编译后优化。

如果这是编译器优化问题或编译后 (CPU) 优化，我唯一可以建议帮助您缩小范围的方法是使用-O2vs再次编译您的代码-O3，但这次添加以下附加选项： -fverbose-asm -S. 将每个输出通过管道传输到两个不同的文件，然后运行 sdiff 之类的东西来比较它们。你应该看到很多不同。

不幸的是，如果对汇编代码没有很好的理解，就很难对它做出正面或反面，老实说，没有多少人在 Stack Overflow 上有耐心（或时间）在这个问题上花费超过几分钟的时间。如果您不精通汇编（大概是 x86），那么我建议您找一位精通的同事或朋友来帮助您解析汇编输出。

score 0 · Accepted Answer

那个不可能到达的分支的引入打破了流程图。通常编译器知道执行流程是从循环的顶部直接到退出测试并再次回到开始。现在图中有一个额外的节点，流可以离开循环。它现在需要以不同的方式编译循环体，分为两部分。

这几乎总是导致更糟糕的代码。为什么它不在这里，我只能提供一个猜测：您没有使用分析信息进行编译。因此，编译器必须做出假设。特别是，它必须对分支将在运行时执行的可能性做出假设。

显然，由于它必须做出的假设不同，因此生成的代码很可能在速度上有所不同。

c++ - How to understand the tricky speed up

3 回答 3

Related

Reference