gpgpu - C++ AMP (GPU) 全局写入更新速度不够快，以至于所有图块都能看到？

Question

我认为这是一个 GPU 问题而不是 C++ AP 问题，因此我对其进行了广泛的标记。

我有一个计算的实现，它将工作分成许多块来完成它们的工作，然后将结果添加到全局内存中的现有值中。首先，tile 中的每个线程都将它们的部分结果计算到 tile_static 内存中，每个线程都有一个要写入的索引。稍后，图块中的第一个线程会将所有部分结果汇总在一起，并将总和添加到全局内存中的某个位置。

瓦片（瓦片中的线程 0）有时会想要写入相同的位置，所以我添加了简单的锁定。

inline void lock(int *lockVariable) restrict(amp)
{
    while (atomic_exchange(lockVariable, 1) != 0);
}

inline void unlock(int *lockVariable) restrict(amp)
{
    *lockVariable = 0;
}

我传递给 lock 和 unlock 的 lock 变量位于一个全局整数数组中，每个争用内存位置一个整数，tile 将写入。

瓦片结果的实际写入，由瓦片中的第一个线程完成，是这样完成的

//now the FIRST thread in the tile will summ all the pulls into one
if (idx.local[0] == 0)
{                   
  double_4 tileAcceleration = 0;
  for (int i = 0; i < idx.tile_dim0; i++)
  {
    tileAcceleration += threadAccelerations[i];
  }
  lock(&locks[j]);
  //now the FIRST thread in the tile will add this to the global result
  acceleration[j] += tileAcceleration;
  unlock(&locks[j]);
}

这大部分都可以，但并非总是如此。必须存在一些竞态条件，因为当相对于要写入的内存位置数量而言，tile 太多时（过多的锁争夺），有时它会无法正确添加 tile 结果。

似乎有时，虽然很少，锁定/解锁设置不能确保正确添加。

这可以通过将锁向上移动到求和前面来“修复”，因此从获得锁到 thread0 进行实际写入之前需要更长的时间。当我在总和中剩下五个元素时，我也可以通过锁定来“修复”它。两者都如下图

第一次修复，速度很慢（阻塞太久）

if (idx.local[0] == 0)
{                   
  lock(&locks[j]); //get lock right away
  double_4 tileAcceleration = 0;
  for (int i = 0; i < idx.tile_dim0; i++)
  {
    tileAcceleration += threadAccelerations[i];
  }
  //now the FIRST thread in the tile will add this to the global result
  acceleration[j] += tileAcceleration;
  unlock(&locks[j]);
}

第二次修复，速度更快

if (idx.local[0] == 0)
{                   
  lock(&locks[j]); //this is a "fix" but a slow one
  double_4 tileAcceleration = 0;
  for (int i = 0; i < idx.tile_dim0; i++)
  {
    tileAcceleration += threadAccelerations[i];
    if (i == idx.tile_dim0 - 5) lock(&locks[j]); //lock when almost done
  }
  //now the FIRST thread in the tile will add this to the global result
  acceleration[j] += tileAcceleration;
  unlock(&locks[j]);
}

看看这些“修复”是如何工作的，很明显，一些内存写入在系统范围内的更新速度不够快。一个磁贴可以锁定一个位置，写入它并解锁。然后另一个图块获得锁，进行添加（但指的是旧的未更新数据）并解锁。

锁是一个 int，数据是一个 double_4，所以看起来锁的释放和更新速度足够快，以便其他图块在数据仍在传输中时看到。即使第一个写入的块还没有完全提交，另一个块可以看到锁是空闲的。因此，第二个图块从缓存中读取未更新的数据值并添加到它并写入...

有人可以帮我理解为什么当第一个磁贴写入时数据没有失效（在缓存中），有人可以帮我找到解决这个问题的正确方法吗？

score 2 · Accepted Answer

简而言之，您在这里所做的并不是解决问题的好方法。首先，C++ AMP 中的原子操作也有以下限制：

您不应该混合原子和正常（非原子）读取和写入。正常读取可能看不到原子写入同一内存位置的结果。正常写入不应与对同一内存位置的原子写入混合。如果您的程序不符合这些标准，这将导致未定义的结果。

原子操作并不意味着任何形式的内存栅栏。原子操作可以重新排序。这与 C++ 中互锁操作的行为不同。

因此，要使您的lock功能正常工作，该unlock功能还需要使用原子读取。

一般来说，您不应该尝试以这种方式锁定，因为它非常低效。您的程序可以使用 tile 屏障原语在同一 tile 上的线程之间同步操作。瓦片操作只保证在内核同步时

看起来您在这里尝试做的是某种减少/累积操作。每个线程生成一个结果，然后将所有这些结果组合起来创建一个（最终）结果。

这是一个简单的归约示例。

#include <vector>
#include <algorithm>
#include <numeric>
#include <amp.h>

using namespace concurrency;

int Reduce(accelerator_view& view, const std::vector<int>& source) const
{
    const int windowWidth = 8;
    int elementCount = static_cast<unsigned>(source.size());

    // Using array as temporary memory.
    array<int, 1> a(elementCount, source.cbegin(), source.cend(), view);

    // Takes care of the sum of tail elements.
    int tailSum = 0;
    if ((elementCount % windowWidth) != 0 && elementCount > windowWidth)
        tailSum = 
            std::accumulate(source.begin() + ((elementCount - 1) / windowWidth) * windowWidth, 
                source.end(), 0);

    array_view<int, 1> avTailSum(1, &tailSum);

    // Each thread reduces windowWidth elements.
    int prevStride = elementCount;
    for (int stride = (elementCount / windowWidth); stride > 0; stride /= windowWidth)
    {
        parallel_for_each(view, extent<1>(stride), [=, &a] (index<1> idx) restrict(amp)
        {
            int sum = 0;
            for (int i = 0; i < windowWidth; i++)
                sum += a[idx + i * stride];
            a[idx] = sum;

            // Reduce the tail in cases where the number of elements is not divisible.
            // Note: execution of this section may negatively affect the performance.
            // In production code the problem size passed to the reduction should
            // be a power of the windowWidth. 
            if ((idx[0] == (stride - 1)) && ((stride % windowWidth) != 0) && (stride > windowWidth))
            {
                for(int i = ((stride - 1) / windowWidth) * windowWidth; i < stride; i++)
                    avTailSum[0] += a[i];
            }
        });
        prevStride = stride;
    }

    // Perform any remaining reduction on the CPU.
    std::vector<int> partialResult(prevStride);
    copy(a.section(0, prevStride), partialResult.begin());
    avTailSum.synchronize();
    return std::accumulate(partialResult.begin(), partialResult.end(), tailSum);
}

一般来说，如果您的并行代码依赖于锁或其他显式同步原语，那么您应该询问这是否真的是正确的方法。如果你能解释更多你想要达到的目标，那么我可能会提供一个更具体的答案

上面的文本和示例来自 The C++ AMP Book。

顺便说一句：您的代码是指tileAccelleration如果您正在实现某种 n 体模型，那么您可以在C++ AMP Book Codeplex 项目中找到完整的实现

gpgpu - C++ AMP (GPU) 全局写入更新速度不够快，以至于所有图块都能看到？

1 回答 1

Related

Reference