c++ - OpenMP 和内存带宽限制

Question

编辑：我的第一个代码示例是错误的。用更简单的固定。

我为大型向量和矩阵之间的代数运算实现了一个 C++ 库。我发现在 x86-x64 CPU 上，OpenMP 并行向量加法、点积等并不比单线程快。并行操作比单线程快 -1% - 6%。这是因为内存带宽限制（我认为）。

所以，问题是，这样的代码是否有真正的性能优势：

void DenseMatrix::identity()
{
    assert(height == width);
    size_t i = 0;
    #pragma omp parallel for if (height > OPENMP_BREAK2)
    for(unsigned int y = 0; y < height; y++)
        for(unsigned int x = 0; x < width; x++, i++)
            elements[i] = x == y ? 1 : 0;
}

在此示例中，使用 OpenMP 没有严重的缺陷。但是，如果我正在使用稀疏向量和稀疏矩阵处理 OpenMP，例如，我不能使用 *.push_back()，在这种情况下，问题变得很严重。（稀疏向量的元素不像密集向量那样连续，因此并行编程有一个缺点，因为结果元素可以随时到达 - 而不是从低到高的索引）

score 1 · Accepted Answer

I don't think this is a problem of memory bandwidth. I see clearly a problem on r: r is accessed from multiple threads, which causes both data races and false sharing. False sharing can dramatically hurt your performance.

I'm wondering whether you can get even the correct answer, because there are data races on r. Did you get the correct answer?

However, the solution would be very simple. The operation conducted on r is reduction, which can be easily achieved by reduction clause of OpenMP.

http://msdn.microsoft.com/en-us/library/88b1k8y5(v=vs.80).aspx

Try to simply append reduction(+ : r) after #pragma omp parallel.

(Note: Additions on double are not commutative and associative. You may see some precision errors, or some differences with the result of the serial code.)

c++ - OpenMP 和内存带宽限制

1 回答 1

Related

Reference