我读过 Mark Harris 的文章 Optimizing Parallel Reduction in CUDA,我发现它非常有用,但有时我仍然无法理解 1 或 2 个概念。它写在第 18 页:
//First add during load
// each thread loads one element from global to shared mem
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
sdata[tid] = g_idata[i];
__syncthreads();
优化代码:有 2 次加载和第一次添加减少:
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x; ...1
unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x; ...2
sdata[tid] = g_idata[i] + g_idata[i+blockDim.x]; ...3
__syncthreads(); ...4
我无法理解第 2 行;如果我有 256 个元素,并且如果我选择 128 作为我的块大小,那么为什么我将它乘以 2?请解释如何确定块大小?