The core of all your loops is something like:
var += something;
In the sequential code, each var
is a local stack variable and with -O0
the line compiles to:
; Compute something and place it in RAX
ADD QWORD PTR [RBP-vvv], RAX
Here vvv
is the offset of var
in the stack frame rooted at the address stored in RBP
.
With OpenMP, certain transformations of the source code take place and the same expression becomes:
*(omp_data->var) = *(omp_data->var) + something;
where omp_data
is a pointer to a structure holding pointers to the shared variables used in the parallel region. This compiles to:
; Compute something and store it in RAX
MOV RDX, QWORD PTR [RBP-ooo] ; Fetch omp_data pointer
MOV RDX, QWORD PTR [RDX] ; Fetch *(omp_data->var)
ADD RDX, RAX
MOV RAX, QWORD PTR [RBP-ooo] ; Fetch omp_data pointer
MOV QWORD PTR [RAX], RDX ; Assign to *(omp_data->var)
This is the first reason the parallel code is slower - the simple action of incrementing var
involves more memory accesses.
The second, and actually stronger reason is the false sharing. You have 8 shared accumulators: xa
, xb
, etc. Each is 8 bytes long and aligned in memory for a total of 64 bytes. Given how most compilers place such variables in memory, they most likely end up next to each other in the same cache line or in two cache lines (a cache line on x86-64 is 64 bytes long and is read and written as a single unit). When one thread writes to its accumulator, e.g., thread 0 updates xa
, this invalidates the cache of all other threads whose accumulators happen to be in the same cache line and they need to re-read the value from an upper level cache or even the main memory. This is bad. This is so bad, that the slowdown it causes is way worse than having to access the accumulators through double pointer indirection.
What does -O1
change? It introduces register optimisation:
register r = *(omp_data->var);
for (a = ...) {
r += something;
}
*(omp_data->var) = r;
Despite var
being a shared variable, OpenMP allows for temporarily divergent memory views in each thread. This allows the compiler to perform register optimisation, in which the value of var
does not change for the duration of the loop.
The solution is to simply make all xa
, xb
, etc. private.