parallel-processing - 提前退出线程是否会破坏块中 CUDA 线程之间的同步？

Question

我正在使用 CUDA 实现某种图像处理算法，并且我对整体线程同步问题有一些疑问。

手头的问题可以这样解释：

我们有一个大小为 W*H 的图像。对于图像的每个像素，我需要运行 9 个相同数据的并行进程，每个进程都会给出一个值数组作为结果（整个算法的数组长度相同，比如说 N，大约是 20 或 30 ）。对于每个像素，这 9 个过程在完成计算后会将其结果累积到最终数组（每个像素的单个数组）中。

为了并行化，我设计了以下结构：我生成尺寸为 (10,10,9) 的块，这意味着每个线程块将处理 10*10 大小的子图像，每个线程将处理 9 个中的 1 个单个像素的相同过程。在这种情况下，网格尺寸将为 (W/10,H/10,1)。对于线程块，我将分配一个长度为 100*N 的共享内存数组，每个线程将根据其当前像素的坐标写入适当的共享内存位置。所以，我需要在这里与 atomicAdd 和 __synchthreads() 同步。

这里的问题是，如果一个像素的值为零，那么我们根本不需要处理它，所以我想退出这样的像素，否则我会做不必要的工作，因为图像的很大一部分由零（背景）。所以，我想写如下内容：

//X and Y are the coordinates of the current pixel in the input image.
//threadIdx.z gives the index of the process among the 9 for the current pixel. 

int X=blockIdx.x * blockDim.x + threadIdx.x;
int Y=blockIdx.y * blockDim.y + threadIdx.y;
int numOfProcessForTheCurrPixel=threadIdx.z;
int linearIndexOfPixelInBlock=threadIdx.y * blockDim.x + threadIdx.x;

unsigned short pixelValue=tex2D(image,X,Y);
//Here, threads processing zero-pixels will exit immediately.
if(pixelValue==0)
 return;

float resultArray[22];
//Fill the result array according to our algorithm, mostly irrelevant stuff.
ProcessPixel(resultArray,X,Y,numOfProcessForTheCurrPixel);

for(int i=0;i<22;i++)
    atomicAdd(&__sharedMemoryArray[22*linearIndexOfPixelInBlock + i],resultArray[i]);

 __syncthreads(); 
 //Then copy from the shared to the global memory and etc.

在这种情况下让我担心的是编程指南所说的：

__syncthreads() 允许在条件代码中使用，但前提是条件在整个线程块中的计算结果相同，否则代码执行可能会挂起或产生意外的副作用。

所以在我的例子中，如果一个 10*10 线程块中的一些像素是零并且一些或者不是，那么属于零像素的线程将在开始时立即退出，而其他线程将继续它们的处理。在这种情况下，同步呢，它会继续正常工作还是会像编程指南所说的那样产生未定义的行为？我想过让零像素线程处理垃圾数据以保持它们忙碌，但是如果我们有完全由零组成的块（并且我们经常有它们），这将不必要地增加处理时间。在这种情况下可以做什么？

score 1 · Accepted Answer

为了避免产生死锁，所有线程都需要无条件地访问 _synchthreads()。在您的示例中，您可以通过使用 if 语句替换 return 来做到这一点，该语句跳过函数的大部分并直接针对 _syncthreads() 用于零像素情况。

unsigned short pixelValue=tex2D(image,X,Y);
//If there's nothing to compute, jump over all the computation stuff
if(pixelValue!=0)
{

    float resultArray[22];
    //Fill the result array according to our algorithm, mostly irrelevant stuff.
    ProcessPixel(resultArray,X,Y,numOfProcessForTheCurrPixel);

    for(int i=0;i<22;i++)
        atomicAdd(&__sharedMemoryArray[22*linearIndexOfPixelInBlock + i],resultArray[i]);

}

__syncthreads(); 

if (pixelValue != 0)
{
    //Then copy from the shared to the global memory and etc. 
}

parallel-processing - 提前退出线程是否会破坏块中 CUDA 线程之间的同步？

1 回答 1

Related

Reference