c++ - C++ openmp 比串行实现慢得多

Question

我正在对二维数组进行热力学模拟。阵列为 1024x1024。while 循环迭代指定的次数或直到 goodTempChange 为 false。goodTempChange 根据块的温度变化大于定义的 EPSILON 值设置为真或假。如果数组中的每个块都低于该值，则该板处于静止状态。该程序可以运行，我的代码没有问题，我的问题是串行代码绝对将openmp代码吹出水面。我不知道为什么。我已经尝试删除除了平均计算之外的所有内容，它只是您想要的正方形周围上、下、左、右 4 个块的平均值，但它仍然被串行代码破坏。我以前从未做过openmp，我在网上查了一些东西来做我所拥有的。我以我能看到的最有效的方式在关键区域内拥有变量，我没有竞争条件。我真的不明白有什么问题。任何帮助将不胜感激。谢谢。

while(iterationCounter < DESIRED_ITERATIONS && goodTempChange) {
  goodTempChange = false;
  if((iterationCounter % 1000 == 0) && (iterationCounter != 0)) {
    cout << "Iteration Count      Highest Change    Center Plate Temperature" << endl;
    cout << "-----------------------------------------------------------" << endl;
    cout << iterationCounter << "               "
         << highestChange << "            " << newTemperature[MID][MID] << endl;
    cout << endl;
  }

  highestChange = 0;

  if(iterationCounter != 0)
    memcpy(oldTemperature, newTemperature, sizeof(oldTemperature));

  for(int i = 1; i < MAX-1; i++) {  
  #pragma omp parallel for schedule(static) 
    for(int j = 1; j < MAX-1; j++) {
      bool tempGoodChange = false;
      double tempHighestChange = 0;
      newTemperature[i][j] = (oldTemperature[i-1][j] + oldTemperature[i+1][j] +
                              oldTemperature[i][j-1] + oldTemperature[i][j+1]) / 4;

      if((iterationCounter + 1) % 1000 == 0) {
        if(abs(oldTemperature[i][j] - newTemperature[i][j]) > highestChange)
          tempHighestChange = abs(oldTemperature[i][j] - newTemperature[i][j]);
        if(tempHighestChange > highestChange) {
          #pragma omp critical
          {
            if(tempHighestChange > highestChange)
              highestChange = tempHighestChange;
          }
        }
      }
      if(abs(oldTemperature[i][j] - newTemperature[i][j]) > EPSILON
         && !tempGoodChange)
        tempGoodChange = true;

      if(tempGoodChange && !goodTempChange) {
        #pragma omp critical
        {
          if(tempGoodChange && !goodTempChane)
            goodTempChange = true;
        }
      }
    }
  }
  iterationCounter++;
}

score 1 · Accepted Answer

试图摆脱那些关键部分可能会有所帮助。例如：

#pragma omp critical
{
  if(tempHighestChange > highestChange)
  {
    highestChange = tempHighestChange;
  }
}

在这里，您可以将highestChange每个线程的计算值存储在局部变量中，并且在并行部分完成时，获取highestChange您拥有的最大值。

score 0 · Accepted Answer

这是我的尝试（未经测试）。

double**newTemperature;
double**oldTemperature;

while(iterationCounter < DESIRED_ITERATIONS && goodTempChange) {
  if((iterationCounter % 1000 == 0) && (iterationCounter != 0))
    std::cout
      << "Iteration Count      Highest Change    Center Plate Temperature\n"
      << "---------------------------------------------------------------\n" 
      << iterationCounter << "               "
      << highestChange << "            "
      << newTemperature[MID][MID] << '\n' << std::endl;

  goodTempChange = false;
  highestChange  = 0;

  // swap pointers to arrays (but not the arrays themselves!)
  std::swap(newTemperature,oldTemperature);
  if(iterationCounter != 0)
    std::swap(newTemperature,oldTemperature);

  bool CheckTempChange = (iterationCounter + 1) % 1000 == 0;
#pragma omp parallel
  {
    bool localGoodChange = false;
    double localHighestChange = 0;
#pragma omp for
    for(int i = 1; i < MAX-1; i++) {
      //
      // note that putting a second
      // #pragma omp for
      // here has (usually) zero effect. this is called nested parallelism and
      // usually not implemented, thus the new nested team of threads has only
      // one thread.
      //
      for(int j = 1; j < MAX-1; j++) {
        newTemperature[i][j] = 0.25 *   // multiply is faster than divide
          (oldTemperature[i-1][j] + oldTemperature[i+1][j] +
           oldTemperature[i][j-1] + oldTemperature[i][j+1]);
        if(CheckTempChange)
          localHighestChange =
            std::max(localHighestChange,
                     std::abs(oldTemperature[i][j] - newTemperature[i][j]));
        localGoodChange = localGoodChange ||
          std::abs(oldTemperature[i][j] - newTemperature[i][j]) > EPSILON;
        // shouldn't this be < EPSILON? in the previous line?
      }
    }
    //
    // note that we have moved the critical sections out of the loops to
    // avoid any potential issues with contentions (on the mutex used to
    // implement the critical section). Also note that I named the sections,
    // allowing simultaneous update of goodTempChange and highestChange
    //
    if(!goodTempChange && localGoodChange)
#pragma omp critical(TempChangeGood)
      goodTempChange = true;
    if(CheckTempChange && localHighestChange > highestChange)
#pragma omp critical(TempChangeHighest)
      highestChange = std::max(highestChange,localHighestChange);
  }
  iterationCounter++;
}

您的原件有几处更改：

for嵌套循环的外部而不是内部是并行执行的。这应该会产生重大影响。 在编辑中添加：从评论中看，你不明白这个的意义，所以让我解释一下。在您的原始代码中，外循环（over i）仅由主线程完成。对于 each i，创建了一组线程来j并行执行内部循环。这会在每个i! 如果改为并行化外循环i，则这种开销只会遇到一次，并且每个线程将运行整个内循环以j 获取 i. 因此，总是尽可能并行化最外层的循环是多线程编码的基本智慧。
双for循环位于并行区域内，以最小化每个while循环每个线程调用一个临界区域。您也可以考虑将整个while循环放在一个并行区域内。
我还在两个数组之间交换（类似于其他答案中的建议）以避免 to memcpy，但这不应该真的对性能至关重要。 在编辑中添加： std::swap(newTemperature,oldTemperature) 只交换指针值而不是指向的内存，当然，这就是重点。

最后，不要忘记布丁的证据在于吃东西#pragma omp for：只要尝试将内循环或外循环放在前面有什么不同。在询问 SO 之前总是做这样的实验——否则你可能会被指责没有做足够的研究。

score -1 · Accepted Answer

我假设您关心循环内整个代码所花费的while时间，而不仅仅是循环开始所花费的时间for(int i = 1; i < MAX-1; i++)。

这个操作

if(iterationCounter != 0)
{
    memcpy(oldTemperature, newTemperature, sizeof(oldTemperature));
}

是不必要的，对于大型阵列，可能足以降低性能。而不是维护 2 个数组old和new，而是维护一个具有两个平面的 3D 数组。创建两个整数变量，我们称它们为oldand new，并将它们初始设置为0and 。1代替

newTemperature[i][j] = ((oldTemperature[i-1][j] +  oldTemperature[i+1][j] + oldTemperature[i][j-1] + oldTemperature[i][j+1]) / 4);

经过

temperature[new][i][j] = 
  (temperature[old][i-1][j] +
   temperature[old][i+1][j] +
   temperature[old][i][j-1] +
   temperature[old][i][j+1])/4;

并且，在更新结束时交换 and 的值，old以便new更新反过来。我将由您决定是否old/new应该是数组的第一个索引还是最后一个索引。这种方法消除了在内存中移动（大量）数据的需要。

这个 SO question and answer涵盖了严重减速或加速失败的另一个可能原因。每当我看到大小为的数组时，2^n我都会怀疑缓存问题。

c++ - C++ openmp 比串行实现慢得多

3 回答 3

Related

Reference