performance - x86“cmp”指令的奇怪行为

Question

这是代码：

#include <iostream>
#include <time.h>

using namespace std;

#define ARR_LENGTH 1000000
#define TEST_NUM 0
typedef unsigned int uint;

uint arr[ARR_LENGTH];

uint inc_time(uint x) {
    uint y = 0, tm = clock();
    for (uint i = 0; i < x; i++) y++;
        return clock() - tm;
}

int main() {
    uint div = 0, mod = 0, tm = 0, overall = 0, inc_tm;
    srand(time(NULL));
    for (uint i = 0; i < ARR_LENGTH; i++) arr[i] = (uint)rand() + 2;

    tm = clock();
    for (uint i = 0; i < ARR_LENGTH - 1; i++)
        if (arr[i] % arr[i+1] != TEST_NUM) mod++;
    overall = clock() - tm;
    inc_tm = inc_time(mod);
    cout << "mods - " << mod << endl;
    cout << "Overall time - " << overall<< endl;
    cout << "   wasted on increment - " << inc_tm << endl;
    cout << "   wasted on condition - " << overall - inc_tm << endl << endl;

    tm = clock();
    for (uint i = 0; i < ARR_LENGTH - 1; i++)
        if (arr[i]/arr[i+1] != TEST_NUM) div++;
    overall = clock()-tm;
    inc_tm = inc_time(div);
    cout << "divs - " << div << endl;
    cout << "Overall time - " << overall << endl;
    cout << "   wasted on increment - " << inc_tm << endl;
    cout << "   wasted on condition - " << overall - inc_tm << endl << endl;

    return 0;
}

如果您使用 Visual Studio，只需在 DEBUG（而不是 RELEASE）模式下编译，如果您使用 GCC，则禁用死代码消除（-fno-dce），否则某些部分代码将无法工作。

所以问题是：当您将 TEST_NUM 常量设置为非零（例如 5）时，两个条件（模数和除法）几乎同时执行，但是当您设置TEST_NUM为 0 时，第二个条件执行速度较慢（向上到 3 次！）。为什么？

这是反汇编列表：反汇编列表图像 http://img213.imageshack.us/slideshow/webplayer.php?id=wp000076.jpg

如果为 0，test则使用指令代替，cmp X, 0但即使您修补cmp X, 5（在 5 的情况下），cmp X, 0您也会看到它不会影响模运算，但会影响除法运算。

TEST_NUM在更改常数时，请仔细观察操作计数和时间如何变化。

如果有人可以，请解释这是怎么发生的？
谢谢。

score 6 · Accepted Answer

在的情况下TEST_NUM == 0，第一个条件很少为真。分支预测将识别这一点并将条件预测为始终为假。这种预测在大多数情况下都是正确的，因此很少需要执行代价高昂的错误预测分支。

'TEST_NUM == 5' 的情况几乎相同：第一个条件很少为真。

对于第二个条件 abd TEST_NUM == 0，除法结果为零，每个条件arr[i] < arr[i+1]的概率约为 0.5。对于分支预测器来说，这是最坏的情况 - 每隔一秒就会预测出错误的分支。平均而言，您将获得错误预测分支所需的时钟周期的一半（取决于架构，这可能在 10 到 20 个周期之间）。

If you have a value of TEST_NUM == 5, the second condition is now rarely true, the probability will be about 0.1 (not quite sure here). This is much better "predictable". Tpically the predictor will predict as (almost) always false, with some random trues in between, but that depends on the innards of the processors. But in any case, you get the additional cycles for a wrong predicted branch not so often, a worst in every fifth case.

performance - x86“cmp”指令的奇怪行为

1 回答 1

Related

Reference