希望你能弄清楚为什么它适用于 n=2 的情况,所以让我们跳过它,看看为什么它在其他情况下会失败,让我们选择 n=4。当 n = 4 时,您有 4 个线程,编号为 0 到 3。
在 for 循环的第一次迭代中,stride = 1,因此通过 if 测试的线程是线程 0 和 2。
thread 0: dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2: dc[2] += da[2]*db[2] + da[3]*db[3];
到现在为止还挺好。在 for 循环的第二次迭代中,步幅为 2,因此通过 if 测试的线程是线程 0(仅)。
thread 0: dc[0] += da[0]*db[0] + da[2]*db[2];
但这没有意义,也不是我们想要的。我们想要的是这样的:
dc[0] += dc[2];
So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block):
// CUDA kernel code
__global__ void sum_reduce(float* da, float* db, float* dc, int n)
{
int tid = threadIdx.x;
// do multiplication in parallel for full width of threads
dc[tid] = da[tid] * db[tid];
// wait for all threads to complete multiply step
__syncthreads();
int stride = blockDim.x;
while (stride > 1){
// handle odd step
if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
// successively divide problem by 2
stride >>= 1;
// add each upper half element to each lower half element
if (tid < stride) dc[tid] += dc[tid + stride];
// wait for all threads to complete add step
__syncthreads();
}
}
Note that I'm not really using the n
parameter. Since you are launching the kernel with n
threads, the blockDim.x
built-in variable is equal to n
in this case.