0

我刚刚开始使用 CUDA,并试图将我的大脑包裹在 CUDA 缩减算法上。就我而言,我一直在尝试获得两个矩阵的点积。但我只得到了大小为 2 的矩阵的正确答案。对于任何其他大小的矩阵,我都弄错了。

这只是测试,所以我保持矩阵大小非常小。只有大约 100 个,所以只有 1 个块可以满足所有要求。任何帮助将不胜感激。谢谢!

这是常规代码

float* ha = new float[n]; // matrix a
float* hb = new float[n]; // matrix b
float* hc = new float[1]; // sum of a.b

float dx = hc[0];
float hx = 0;
// dot product
for (int i = 0; i < n; i++)
    hx += ha[i] * hb[i];

这是我的 cuda 内核

__global__ void sum_reduce(float* da, float* db, float* dc, int n)
 {
     int tid = threadIdx.x;
     dc[tid] = 0;
     for (int stride = 1; stride < n; stride *= 2) {
         if (tid % (2 * stride) == 0)
                 dc[tid] += (da[tid] * db[tid]) + (da[tid+stride] * db[tid+stride]);
         __syncthreads();
     }
 }

我的完整代码: http: //pastebin.com/zS85URX5

4

1 回答 1

2

希望你能弄清楚为什么它适用于 n=2 的情况,所以让我们跳过它,看看为什么它在其他情况下会失败,让我们选择 n=4。当 n = 4 时,您有 4 个线程,编号为 0 到 3。

在 for 循环的第一次迭代中,stride = 1,因此通过 if 测试的线程是线程 0 和 2。

thread 0:   dc[0] += da[0]*db[0] + da[1]*db[1];
thread 2:   dc[2] += da[2]*db[2] + da[3]*db[3];

到现在为止还挺好。在 for 循环的第二次迭代中,步幅为 2,因此通过 if 测试的线程是线程 0(仅)。

thread 0:   dc[0] += da[0]*db[0] + da[2]*db[2]; 

但这没有意义,也不是我们想要的。我们想要的是这样的:

dc[0] += dc[2];

So it's broken. I spent a little while trying to think about how to fix this in just a few steps, but it just doesn't make sense to me as a reduction. If you replace your kernel code with this code, I think you'll have good results. It's not a lot like your code, but it was the closest I could come to something that would work for all the cases you've envisioned (ie. n < max thread block size, using a single block):

 // CUDA kernel code
 __global__ void sum_reduce(float* da, float* db, float* dc, int n)
 {
     int tid = threadIdx.x;
     // do multiplication in parallel for full width of threads
     dc[tid] = da[tid] * db[tid];
     // wait for all threads to complete multiply step
     __syncthreads();
     int stride = blockDim.x;
     while (stride > 1){
       // handle odd step
       if ((stride & 1) && (tid == 0)) dc[0] += dc[stride - 1];
       // successively divide problem by 2
       stride >>= 1;
       // add each upper half element to each lower half element
       if (tid < stride) dc[tid] += dc[tid + stride];
       // wait for all threads to complete add step
       __syncthreads();
       }
 }

Note that I'm not really using the n parameter. Since you are launching the kernel with n threads, the blockDim.x built-in variable is equal to n in this case.

于 2013-03-30T05:18:59.453 回答