c++ - C ++：将一个操作数保存在寄存器中带来了惊人的加速

Question

我一直在尝试通过使用以下代码对数组元素进行缩放和求和的例程定时来了解在 L1 缓存中拥有数组与内存的影响（我知道我应该只通过'来缩放结果a' 在最后；关键是在循环中同时进行乘法和加法 - 到目前为止，编译器还没有想出分解'a'）：

double sum(double a,double* X,int size)
{
    double total = 0.0;
    for(int i = 0;  i < size; ++i)
    {
        total += a*X[i];
    }
    return total;
}

#define KB 1024
int main()
{
    //Approximately half the L1 cache size of my machine
    int operand_size = (32*KB)/(sizeof(double)*2);
    printf("Operand size: %d\n", operand_size);
    double* X = new double[operand_size];
    fill(X,operand_size);

    double seconds = timer();
    double result;
    int n_iterations = 100000;
    for(int i = 0; i < n_iterations; ++i)
    {
        result = sum(3.5,X,operand_size);
        //result += rand();  
    }
    seconds = timer() - seconds; 

    double mflops = 2e-6*double(n_iterations*operand_size)/seconds;
    printf("Vector size %d: mflops=%.1f, result=%.1f\n",operand_size,mflops,result);
    return 0;
}

请注意，为简洁起见，不包括 timer() 和 fill() 例程；如果您想运行代码，可以在此处找到它们的完整源代码：

http://codepad.org/agPWItZS

现在，这就是有趣的地方。这是输出：

Operand size: 2048
Vector size 2048: mflops=588.8, result=-67.8

这完全是未缓存的性能，尽管 X 的所有元素都应该在循环迭代之间保存在缓存中。查看由以下生成的汇编代码：

g++ -O3 -S -fno-asynchronous-unwind-tables register_opt_example.cpp

我注意到 sum 函数循环中的一个奇怪之处：

L55:
    movsd   (%r12,%rax,8), %xmm0
    mulsd   %xmm1, %xmm0
    addsd   -72(%rbp), %xmm0
    movsd   %xmm0, -72(%rbp)
    incq    %rax
    cmpq    $2048, %rax
    jne L55

说明：

    addsd   -72(%rbp), %xmm0
    movsd   %xmm0, -72(%rbp)

表示它将 sum() 中“total”的值存储在堆栈上，并在每次循环迭代时读取和写入它。我修改了程序集，使这个操作数保存在一个寄存器中：

...
addsd   %xmm0, %xmm3
...

这个微小的变化创造了巨大的性能提升：

Operand size: 2048
Vector size 2048: mflops=1958.9, result=-67.8

tl; dr 我的问题是：为什么用寄存器替换单个内存位置访问会大大加快代码速度，因为单个位置应该存储在 L1 缓存中？哪些架构因素使这成为可能？重复写入一个堆栈位置会完全破坏缓存的有效性，这似乎很奇怪。

附录

我的 gcc 版本是：

Target: i686-apple-darwin10
Configured with: /var/tmp/gcc/gcc-5646.1~2/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10
Thread model: posix
gcc version 4.2.1 (Apple Inc. build 5646) (dot 1)

我的CPU是：

英特尔至强 X5650

score 61 · Accepted Answer

它可能是较长的依赖链以及负载错误预测*的组合。

更长的依赖链：

首先，我们确定关键的依赖路径。然后我们看看提供的指令延迟：http ://www.agner.org/optimize/instruction_tables.pdf （第117页）

在未优化版本中，关键依赖路径为：

addsd -72(%rbp), %xmm0
movsd %xmm0, -72(%rbp)

在内部，它可能分解为：

负载（2 个周期）
添加（3 个周期）
存储（3 个周期）

如果我们查看优化版本，它只是：

添加（3 个周期）

所以你有 8 个周期与 3 个周期。几乎是 3 倍。

我不确定 Nehalem 处理器系列对存储负载依赖项的敏感程度以及它在转发方面的表现如何。但有理由相信它不为零。

加载存储错误预测：

现代处理器以您可以想象的更多方式使用预测。其中最著名的可能是分支预测。鲜为人知的一种是负载预测。

当处理器看到负载时，它将在所有挂起的写入完成之前立即加载它。它将假定这些写入不会与加载的值冲突。

如果较早的写入结果与加载冲突，则必须重新执行加载并将计算回滚到加载点。（与分支错误预测回滚的方式大致相同）

它在这里是如何相关的：

不用说，现代处理器将能够同时执行该循环的多次迭代。因此，处理器将尝试执行加载（在它完成上一次迭代addsd -72(%rbp), %xmm0)的存储（）之前）。movsd %xmm0, -72(%rbp)

结果？先前的存储与负载冲突 - 因此是错误预测和回滚。

_{*请注意，我不确定“负载预测”的名称。我只在英特尔文档中读到过它，他们似乎没有给它起名字。}

score 16 · Accepted Answer

我推测问题不在于缓存/内存访问，而在于处理器（执行代码）。这里有几个可见的瓶颈。

这里的性能数字基于我使用的盒子（sandybridge 或 westmere）

标量数学的峰值性能是 2.7Ghz x2 FLOPS/Clock x2，因为处理器可以同时进行加法和乘法运算。代码的理论效率为0.6/(2.7*2) = 11%

所需带宽：每 (+) 和 (x) 2 个双倍 -> 4bytes/Flop 4 bytes * 5.4GFLOPS = 21.6GB/s

如果您知道它最近被读取，它可能在 L1 (89GB/s)、L2 (42GB/s) 或 L3(24GB/s) 中，因此我们可以排除缓存 B/W

内存子系统为 18.9 GB/s，因此即使在主内存中，峰值性能也应接近 18.9/21.6GB/s = 87.5 %

可能希望尽早批处理请求（通过展开）

即使是推测性执行，tot += a *X[i] 添加也会被序列化，因为 tot(n) 需要在 tot(n+1) 可以启动之前进行评估

第一个展开循环
将 i 移动 8 位并执行

{//your func
    for( int i = 0; i < size; i += 8 ){
        tot += a * X[i];
        tot += a * X[i+1];
        ...
        tot += a * X[i+7];
    }
    return tot
}

使用多个累加器
这将打破依赖关系并允许我们避免在加法管道上停滞

{//your func//
    int tot,tot2,tot3,tot4;
    tot = tot2 = tot3 = tot4 = 0
    for( int i = 0; i < size; i += 8 ) 
        tot  += a * X[i];
        tot2 += a * X[i+1];
        tot3 += a * X[i+2];
        tot4 += a * X[i+3];
        tot  += a * X[i+4];
        tot2 += a * X[i+5];
        tot3 += a * X[i+6];
        tot4 += a * X[i+7];
    }
    return tot + tot2 + tot3 + tot4;
}

更新在 SandyBridge 盒子上运行后，我可以访问：（2.7GHZ SandyBridge with -O2 -march=native -mtune=native

原始代码：

Operand size: 2048  
Vector size 2048: mflops=2206.2, result=61.8  
2.206 / 5.4 = 40.8%

改进的代码：

Operand size: 2048  
Vector size 2048: mflops=5313.7, result=61.8  
5.3137 / 5.4 = 98.4%

score 8 · Accepted Answer

我实际上无法重现这一点，因为我的编译器（gcc 4.7.2）保存total在一个寄存器中。

我怀疑缓慢的主要原因与 L1 缓存无关，而是由于存储在

movsd   %xmm0, -72(%rbp)

以及后续迭代的负载：

addsd   -72(%rbp), %xmm0

c++ - C ++：将一个操作数保存在寄存器中带来了惊人的加速

3 回答 3

Related

Reference