c - openmp 比一个线程慢，想不通

Question

我遇到了一个问题，即我的以下代码在使用 openmp 时运行速度较慢：

chunk = nx/nthreads;
int i, j;
for(int t = 0; t < n; t++){
     #pragma omp parallel for default(shared) private(i, j) schedule(static,chunk) 
     for(i = 1; i < nx/2+1; i++){
        for(j = 1; j < nx-1; j++){
            T_c[i][j] =0.25*(T_p[i-1][j] +T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
            T_c[nx-i+1][j] = T_c[i][j];
        }
    }
    copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);

问题是当我运行多个线程时，计算时间会长得多。

score 2 · Accepted Answer

首先，您的并行区域在外部循环的每次迭代中都会重新启动，从而增加了巨大的开销。

其次，一半的线程将只是坐在那里什么都不做，因为你的块大小是它应该的两倍大 - 这是nx/nthreads并行循环的迭代次数是nx/2，因此总共有(nx/2)/(nx/nthreads) = nthreads/2块。除了你试图实现的是复制schedule(static).

#pragma omp parallel
for (int t = 0; t < n; t++) {
   #pragma omp for schedule(static) 
   for (int i = 1; i < nx/2+1; i++) {
      for (int j = 1; j < nx-1; j++) {
         T_c[i][j] = 0.25*(T_p[i-1][j]+T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
         T_c[nx-i-1][j] = T_c[i][j];
      }
   }
   #pragma omp single
   copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);

如果您修改copyT为也使用 parallel for，single则应删除该构造。您不需要default(shared)，因为这是默认设置。您无需声明并行循环的循环变量private- 即使此变量来自外部范围（因此在区域中隐式共享），OpenMP 也会自动将其设为私有。只需在循环控件中声明所有循环变量，它就会自动使用应用的默认共享规则。

第二个半，您的内部循环中（可能）存在错误。第二个评估声明应为：

T_c[nx-i-1][j] = T_c[i][j];

（或者T_c[nx-i][j]如果您没有在下侧保留光环）否则当iequals时1，您将访问T_c[nx][...]超出T_c.

第三，一般提示：不要将一个数组复制到另一个数组中，而是使用指向这些数组的指针，并在每次迭代结束时交换两个指针。

score 1 · Accepted Answer

我在您发布的代码段中看到至少三个可能导致性能不佳的问题：

当在线程之间划分时，块大小太小而无法显示任何增益。
parallel循环内区域的打开和关闭可能会损害性能。
两个最里面的循环似乎是独立的，并且您只并行化其中一个（失去了利用更广泛的迭代空间的可能性）。

您可以在下面找到我将对代码进行的一些修改：

// Moving the omp parallel you open/close the parallel 
// region only one time, not n times
#pragma omp parallel default(shared)
for(int t = 0; t < n; t++){
     // With collapse you parallelize over an iteration space that is 
     // composed of (nx/2+1)*(nx-1) elements not only (nx/2+1)
     #pragma omp for collapse(2) schedule(static)
     for(int i = 1; i < nx/2+1; i++){
        for(int j = 1; j < nx-1; j++){
            T_c[i][j] =0.25*(T_p[i-1][j] +T_p[i+1][j]+T_p[i][j-1]+T_p[i][j+1]);
            T_c[nx-i+1][j] = T_c[i][j];
        }
    }
    // As the iteration space is very small and the work done 
    // at each iteration is not much, static schedule will likely be the best option
    // as it is the one that adds the least overhead for scheduling
    copyT(T_p, T_c, nx);
}
print2file(T_c, nx, file);

c - openmp 比一个线程慢，想不通

2 回答 2

Related

Reference