cuda - 具有减少功能的 Cuda 内核 - 2 个矩阵的点积的逻辑错误

Question

我刚刚开始使用 CUDA，并试图将我的大脑包裹在 CUDA 缩减算法上。就我而言，我一直在尝试获得两个矩阵的点积。但我只得到了大小为 2 的矩阵的正确答案。对于任何其他大小的矩阵，我都弄错了。

这只是测试，所以我保持矩阵大小非常小。只有大约 100 个，所以只有 1 个块可以满足所有要求。任何帮助将不胜感激。谢谢！

这是常规代码

float* ha = new float[n]; // matrix a
float* hb = new float[n]; // matrix b
float* hc = new float[1]; // sum of a.b

float dx = hc[0];
float hx = 0;
// dot product
for (int i = 0; i < n; i++)
    hx += ha[i] * hb[i];

这是我的 cuda 内核

__global__ void sum_reduce(float* da, float* db, float* dc, int n)
 {
     int tid = threadIdx.x;
     dc[tid] = 0;
     for (int stride = 1; stride < n; stride *= 2) {
         if (tid % (2 * stride) == 0)
                 dc[tid] += (da[tid] * db[tid]) + (da[tid+stride] * db[tid+stride]);
         __syncthreads();
     }
 }

我的完整代码： http: //pastebin.com/zS85URX5

score 2 · Accepted Answer

希望你能弄清楚为什么它适用于 n=2 的情况，所以让我们跳过它，看看为什么它在其他情况下会失败，让我们选择 n=4。当 n = 4 时，您有 4 个线程，编号为 0 到 3。

在 for 循环的第一次迭代中，stride = 1，因此通过 if 测试的线程是线程 0 和 2。

thread 0:   dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2:   dc[2] += da[2]*db[2] + da[3]*db[3];

到现在为止还挺好。在 for 循环的第二次迭代中，步幅为 2，因此通过 if 测试的线程是线程 0（仅）。

thread 0:   dc[0] += da[0]*db[0] + da[2]*db[2];

但这没有意义，也不是我们想要的。我们想要的是这样的：

dc[0] += dc[2];

So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block):

 // CUDA kernel code
 __global__ void sum_reduce(float* da, float* db, float* dc, int n)
 {
     int tid = threadIdx.x;
     // do multiplication in parallel for full width of threads
     dc[tid] = da[tid] * db[tid];
     // wait for all threads to complete multiply step
     __syncthreads();
     int stride = blockDim.x;
     while (stride > 1){
       // handle odd step
       if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
       // successively divide problem by 2
       stride >>= 1;
       // add each upper half element to each lower half element
       if (tid < stride) dc[tid] += dc[tid + stride];
       // wait for all threads to complete add step
       __syncthreads();
       }
 }

Note that I'm not really using the n parameter. Since you are launching the kernel with n threads, the blockDim.x built-in variable is equal to n in this case.

cuda - 具有减少功能的 Cuda 内核 - 2 个矩阵的点积的逻辑错误

1 回答 1

Related

Reference