performance - 如何在 CUDA 中不使用 atomic 进行总和计算

Question

在下面的代码中，如何在不使用atomicAdd的情况下计算sum_array值。

内核方法

__global__ void calculate_sum( int width,
                               int height,
                               int *pntrs,
                               int2 *sum_array )
{
    int row = blockIdx.y * blockDim.y + threadIdx.y;
    int col = blockIdx.x * blockDim.x + threadIdx.x;

    if ( row >= height || col >= width ) return;

    int idx = pntrs[ row * width + col ];

    //atomicAdd( &sum_array[ idx ].x, col );

    //atomicAdd( &sum_array[ idx ].y, row );

    sum_array[ idx ].x += col;

    sum_array[ idx ].y += row;
}

启动内核

    dim3 dimBlock( 16, 16 );
    dim3 dimGrid( ( width + ( dimBlock.x - 1 ) ) / dimBlock.x, 
                  ( height + ( dimBlock.y - 1 ) ) / dimBlock.y );

score 1 · Accepted Answer

归约是这类问题的总称。查看演示文稿以获得进一步的解释或使用 Google 获取其他示例。

解决这个问题的一般方法是在线程块内对全局内存段进行并行求和，并将结果存储在全局内存中。然后，将部分结果复制到 CPU 内存空间，使用 CPU 对部分结果求和，然后将结果复制回 GPU 内存。您可以通过对部分结果执行另一个并行求和来避免内存占用。

另一种方法是为 CUDA 使用高度优化的库，例如 Thrust 或 CUDPP，其中包含执行这些操作的函数。

score 0 · Accepted Answer

我的 Cuda非常生锈，但这大致就是您的操作方式（由“Cuda by Example”提供，我强烈建议您阅读）：

https://developer.nvidia.com/content/cuda-example-introduction-general-purpose-gpu-programming-0

对需要求和的数组进行更好的分区：CUDA 中的线程是轻量级的，但不是太多，以至于您只需两个求和即可生成一个，并希望获得任何性能收益作为回报。
此时，每个线程将负责对数据的一部分求和：创建一个与线程数一样大的共享 int 数组，其中每个线程将保存它计算的部分和。
同步线程并减少共享内存数组：

（请把它当作伪代码）

// Code to sum over a slice, essentially a loop over each thread subset
// and accumulate over "localsum" (a local variable)
...

// Save the result in the shared memory
partial[threadidx] = localsum;

// Synchronize the threads:
__syncthreads();

// From now on partial is filled with the result of all computations: you can reduce partial
// we'll do it the illiterate way, using a single thread (it can be easily parallelized)
if(threadidx == 0) {
    for(i = 1; i < nthreads; ++i) {
        partial[0] += partial[i];
    }
}

然后离开：partial[0] 将保存您的总和（或计算）。

请参阅“CUDA by example”中的点积示例，以获得对该主题的更严格讨论以及运行时间约为 O(log(n)) 的缩减算法。

希望这可以帮助

performance - 如何在 CUDA 中不使用 atomic 进行总和计算

2 回答 2

Related

Reference