performance - OpenMP 的效率与优化级别

Question

我是openmp的新手，但是这几天我一直对此感到困惑，在网上找不到任何答案。希望这里有人可以向我解释这个奇怪的现象。

我想比较同一程序的顺序版本和并行版本之间的运行时间。当我使用 -O 或更高版本（在 gcc-10 上）编译它们（但不同级别之间的差异很小）时，并行版本的运行速度比顺序版本（~5x）快得多。

但是，当我使用 -O0 编译这两个程序时，情况并非如此。事实上，当使用 -O0 计算两个版本时，顺序版本甚至会稍微快一些。我试图了解仅在 O1 及更高版本中启用的某些优化是否产生了实质性影响，但没有运气。

作为记录，使用 -Os 编译比 -O0 好，但效率远低于 -O1 及更高版本。

有没有人注意到类似的事情？对此有解释吗？

谢谢！

====

以下是 c 文件的链接：顺序代码、并行代码

score 3 · Accepted Answer

The core of all your loops is something like:

var += something;

In the sequential code, each var is a local stack variable and with -O0 the line compiles to:

; Compute something and place it in RAX
ADD QWORD PTR [RBP-vvv], RAX

Here vvv is the offset of var in the stack frame rooted at the address stored in RBP.

With OpenMP, certain transformations of the source code take place and the same expression becomes:

*(omp_data->var) = *(omp_data->var) + something;

where omp_data is a pointer to a structure holding pointers to the shared variables used in the parallel region. This compiles to:

; Compute something and store it in RAX
MOV RDX, QWORD PTR [RBP-ooo]  ; Fetch omp_data pointer
MOV RDX, QWORD PTR [RDX]      ; Fetch *(omp_data->var)
ADD RDX, RAX
MOV RAX, QWORD PTR [RBP-ooo]  ; Fetch omp_data pointer
MOV QWORD PTR [RAX], RDX      ; Assign to *(omp_data->var)

This is the first reason the parallel code is slower - the simple action of incrementing var involves more memory accesses.

The second, and actually stronger reason is the false sharing. You have 8 shared accumulators: xa, xb, etc. Each is 8 bytes long and aligned in memory for a total of 64 bytes. Given how most compilers place such variables in memory, they most likely end up next to each other in the same cache line or in two cache lines (a cache line on x86-64 is 64 bytes long and is read and written as a single unit). When one thread writes to its accumulator, e.g., thread 0 updates xa, this invalidates the cache of all other threads whose accumulators happen to be in the same cache line and they need to re-read the value from an upper level cache or even the main memory. This is bad. This is so bad, that the slowdown it causes is way worse than having to access the accumulators through double pointer indirection.

What does -O1 change? It introduces register optimisation:

register r = *(omp_data->var);
for (a = ...) {
   r += something;
}
*(omp_data->var) = r;

Despite var being a shared variable, OpenMP allows for temporarily divergent memory views in each thread. This allows the compiler to perform register optimisation, in which the value of var does not change for the duration of the loop.

The solution is to simply make all xa, xb, etc. private.

performance - OpenMP 的效率与优化级别

1 回答 1

Related

Reference