c++ - 缓存刷新例程之间的时序不一致

Question

我需要运行DAXPY线性代数内核的计时。天真地，我想尝试这样的事情：

fill(X,operand_size);
fill(Y,operand_size);
double seconds = timer();
daxpy(alpha,X,Y,operand_size);
seconds = timer() - seconds;

如果需要，完整的代码链接在最后。

问题是，填充操作数 x 和 y 的内存访问将导致它们被放置在处理器缓存中。因此，在 DAXPY 调用中对内存的后续访问比在生产运行中实际要快得多。

我比较了解决这个问题的两种方法。第一种方法是通过clflush指令从所有级别的缓存中刷新操作数。第二种方法是读取一个非常大的数组，直到操作数条目“自然地”从缓存中逐出。我对两者都进行了测试，这是单个 DAXPY 调用的运行时，操作数大小为 2048：

Without flush: time=2840 ns
With clflush: time=4090 ns
With copy flush: time=5919 ns

这是几秒钟后进行的另一次运行：

Without flush: time=2717 ns
With clflush: time=4121 ns
With copy flush: time=4796 ns

正如预期的那样，刷新增加了运行时间。但是，我不明白副本刷新如何导致 DAXPY 例程的运行时间大大延长。clflush 指令应该从所有缓存中逐出操作数，因此使用 clflush 的时间应该是任何其他缓存刷新过程的执行时间的上限。不仅如此，刷新的时间（对于这两种方法）也会反弹很多（数百纳秒，而未刷新的情况不到 10 纳秒）。有谁知道为什么手动冲洗会在运行时产生如此巨大的差异？

附录

包含所有计时例程和刷新例程的完整代码在这里（194 行）：

http://codepad.org/hNJpQxTv

这是我的 gcc 版本。代码使用 -O3 选项编译。（我知道，它很旧；我必须构建的某些软件与较新的 gcc 不兼容）

使用内置规范。

Target: i686-apple-darwin10
Configured with: /var/tmp/gcc/gcc-5646.1~2/src/configure --disable-checking --enable-werror --prefix=/usr --mandir=/share/man --enable-languages=c,objc,c++,obj-c++ --program-transform-name=/^[cg][^.-]*$/s/$/-4.2/ --with-slibdir=/usr/lib --build=i686-apple-darwin10 --with-gxx-include-dir=/include/c++/4.2.1 --program-prefix=i686-apple-darwin10- --host=x86_64-apple-darwin10 --target=i686-apple-darwin10
Thread model: posix
gcc version 4.2.1 (Apple Inc. build 5646) (dot 1)

我使用的是带有 Intel Xeon 5650 处理器的 Mac OS X 10.6。

score 2 · Accepted Answer

（部分）CPU 对正常读取实际执行的操作（在您进行基准测试时会经常发生）的伪代码可能是：

if( cache line is not in cache ) {
    if(cache is full) {
        find cache line to evict
        if( cache line to evict is in "modified" state ) {
            write cache line to evict to slower RAM
        }
        set cache line to "free"
    }
    fetch the cache line from slower RAM 
}
temp = extract whatever we're reading from the cache line

如果您曾经CLFLUSH刷新缓存，那么if(cache is full)将是错误的，因为CLFUSH缓存是空的。

如果您使用复制来刷新缓存，那么if(cache is full)分支将是正确的，并且if( cache line is modified )一半的时间也是正确的（一半的缓存将包含您在复制期间读取的数据，另一半将包含您在复制期间写入的数据）。这意味着您有一半的时间最终会执行write cache line to evict to slower RAM.

这样做write cache line to evict to RAM会消耗 RAM 芯片带宽并影响fetch the cache line from slower RAM.

c++ - 缓存刷新例程之间的时序不一致

1 回答 1

Related

Reference