c++ - 多核虚假共享

Question

下面的程序会不会发生虚假分享？

记忆

1个数组分成4个相等的区域：[A1, A2, B1, B2]
整个数组可以放入实际程序中的 L1 缓存中。
每个区域被填充为 64 字节的倍数。

脚步

1. thread 1 write to region A1 and A2 while thread 2 write to region B1 and B2.
2. barrier
3. thread 1 read B1 and write to A1 while thread 2 read B2 and write to A2.
4. barrier
5. Go to step 1.

测试

#include <vector>
#include <iostream>
#include <stdint.h>
int main() {
    int N = 64;
    std::vector<std::int32_t> x(N, 0);
    #pragma omp parallel
    {
        for (int i = 0; i < 1000; ++i) {
            #pragma omp for
            for (int j = 0; j < 2; ++j) {
                for (int k = 0; k < (N / 2); ++k) {
                    x[j*N/2 + k] += 1;
                }
            }
            #pragma omp for
            for (int j = 0; j < 2; ++j) {
                for (int k = 0; k < (N/4); ++k) {
                    x[j*N/4 + k] += x[N/2 + j*N/4 + k] - 1;
                }
            }
        }
    }
    for (auto i : x ) std::cout << i << " ";
    std::cout << "\n";
}

结果

32 elements of 500500 (1000 * 1001 / 2)
32 elements of 1000

score 4 · Accepted Answer

您的代码中有一些错误的共享，因为x不能保证与缓存行对齐。填充不一定足够。在你的例子N中真的很小，这可能是一个问题。请注意，在您的示例N中，最大的开销可能是工作共享和线程管理。如果N足够大，即array-size / number-of-threads >> cache-line-size，错误共享不是相关问题。

就缓存使用而言，从代码中的不同线程交替写入A2也不是最佳的，但这不是错误的共享问题。

请注意，您不需要拆分循环。如果您在循环中连续访问内存中的索引，则一个循环就可以了，例如

#pragma omp for
for (int j = 0; j < N; ++j)
    x[j] += 1;

如果您真的很小心，您可以添加schedule(static)，那么您可以保证均匀连续的单词分布。

请记住，错误共享是性能问题，而不是正确性问题，并且仅在频繁发生时才相关。典型的不良模式是写入vector[my_thread_index].

c++ - 多核虚假共享

1 回答 1

Related

Reference