c++ - OpenMP 并行化以提高构造性能

Question

第一种方法（并行化内循环）：

for(j=0; j<LATTICE_VW; ++j) {
    x = j*DX + LATTICE_W;
    #pragma omp parallel for ordered private(y, prob)
        for(i=0; i<LATTICE_VH; ++i) {
            y = i*DY + LATTICE_S;
            prob = psi[i][j].norm();

            #pragma omp ordered
                out << x << " " << y << " " << prob << endl;
        }
}

第二种方法（并行化外循环）：

#pragma omp parallel for ordered private(x, y, prob)
    for(j=0; j<LATTICE_VW; ++j) {
        x = j*DX + LATTICE_W;
        for(i=0; i<LATTICE_VH; ++i) {
            y = i*DY + LATTICE_S;
            prob = psi[i][j].norm();

            #pragma omp ordered
                out << x << " " << y << " " << prob << endl;
        }
    }

第三种方法（并行化折叠循环）

#pragma omp parallel for collapse(2) ordered private(x, y, prob)
    for(j=0; j<LATTICE_VW; ++j) {
        for(i=0; i<LATTICE_VH; ++i) {
            x = j*DX + LATTICE_W;
            y = i*DY + LATTICE_S;
            prob = psi[i][j].norm();

            #pragma omp ordered
                out << x << " " << y << " " << prob << endl;
        }
    }

如果我要猜测，我会说方法 3 应该是最快的。

然而，方法 1 是最快的，而第二种和第三种方法都需要大约相同的时间，就好像没有并行化一样。为什么会发生这种情况？

score 0 · Accepted Answer

看看这个：

for(int x = 0; x < 4; ++x)
  #pragma omp parallel for ordered
  for(int y = 0; y < 4; ++y)
    #pragma omp ordered
    cout << x << ',' << y << " (by thread " << omp_get_thread_num() << ')' << endl;

你有：

0,0 (by thread 0)
0,1 (by thread 1)
0,2 (by thread 2)
0,3 (by thread 3)
1,0 (by thread 0)
1,1 (by thread 1)
1,2 (by thread 2)
1,3 (by thread 3)

每个线程只需要等待一些cout所有的工作才能并行完成。但是有：

#pragma omp parallel for ordered
for(int x = 0; x < 4; ++x)
  for(int y = 0; y < 4; ++y)
    #pragma omp ordered
    cout << x << ',' << y << " (by thread " << omp_get_thread_num() << ')' << endl;

和

#pragma omp parallel for collapse(2) ordered
for(int x = 0; x < 4; ++x)
  for(int y = 0; y < 4; ++y)
    #pragma omp ordered
    cout << x << ',' << y << " (by thread " << omp_get_thread_num() << ')' << endl;

情况是：

0,0 (by thread 0)
0,1 (by thread 0)
0,2 (by thread 0)
0,3 (by thread 0)
1,0 (by thread 1)
1,1 (by thread 1)
1,2 (by thread 1)
1,3 (by thread 1)
2,0 (by thread 2)
2,1 (by thread 2)
2,2 (by thread 2)
2,3 (by thread 2)
3,0 (by thread 3)
3,1 (by thread 3)
3,2 (by thread 3)
3,3 (by thread 3)

所以thread 1必须等待thread 0完成所有工作，才能cout第一次完成，几乎没有什么可以并行完成。

尝试添加schedule(static,1)到 collapse-version 中，它应该至少与第一个版本一样好。

c++ - OpenMP 并行化以提高构造性能

1 回答 1

Related

Reference