parallel-processing - 如何在 CUDA 应用程序中正确应用线程同步？

Question

通常我在我的应用程序中偶尔使用线程同步，因为我并不经常需要这个功能。我不是真正的高级 C/C++ 程序员，但我也不是初学者。与 CPU 的强大功能相比，我开始学习 CUDA C，因为现在 GPU 的强大功能让我兴奋不已，我意识到 CUDA 编程主要是关于并行线程执行，有时需要适当的线程同步。事实上，我什至还不知道如何在 C 或 C++ 中应用线程同步。我最后一次使用同步是大约 2 年前，当时我正在用 Java 编写这样的简单应用程序：

synchronized returnType functionName(parameters)
{
    ...
}

什么允许“functionName”在一个 tmie 中仅由一个线程执行 - 也就是说，该函数由不同的线程交替执行。现在回到 CUDA C，如果我在一个块中有 200 个线程，它们在 while 循环中运行代码：

while(some_condition)
{
    ...
}

如何使线程 <0 - 99> 彼此同步，线程 <100 - 199> 也彼此同步，但以线程 <0 - 99> 和 <100 - 199> 交替执行的方式应用同步（即前 100 个线程运行 'while' 的内容，然后接下来的 100 个线程运行 'while' 的内容，依此类推）？

score 4 · Accepted Answer

I think you may simply need to learn more about cuda. You may be falling into a trap of thinking that a previous programming paradigm that you learned is something that should be applied here. I'm not sure that's the case.

But to answer your question, first let me point out that thread synchronization in CUDA is only possible within a threadblock. So my comments only apply there.

The principal sync mechanism in device code is __syncthreads(). To use it roughly along the lines you describe, I could code something like this:

__syncthreads();
if (threadIdx.x < 100){
   // code in this block will only be executed by threads 0-99, all others do nothing
  }
__syncthreads();
if ((threadIdx.x > 99) && (threadIdx.x < 200)){
  // code in this block will only be executed by threads 100-199, all others do nothing
  }
// all threads can begin executing at this point

Note that even threads in a threadblock are not all executing in lockstep. The SM (the threadblock processing unit in a CUDA GPU) generally breaks threadblocks into groups of 32 threads called warps and these warps are actually (more or less) executing in lockstep. However the code I listed above still has the effect I describe, in terms of sequencing execution amongst groups of threads, if you wanted to do that for some reason.

parallel-processing - 如何在 CUDA 应用程序中正确应用线程同步？

1 回答 1

Related

Reference