cuda - Optimize the kernel function based on parellel reduction

Question

In one of my previous posts I asked how it was possible to improve a kernel function. The kernel compute the squared euclidean distance between the corresponding rows of two equal sized matrices. Eric gave a very good tip to use one thread block per row and after that apply parallel reduction. Before continue with further details this post is made because I did not want to make more complicated the previous post and I give my thanks to Eric. Below I attached the .cu code which is not give me the correct results.

__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols )
{
    extern __shared__ float sdata[];

    unsigned int tid = threadIdx.x;
    unsigned int c = blockDim.x * blockIdx.x + threadIdx.x; // rows
    unsigned int r = blockDim.y * blockIdx.y + threadIdx.y; // cols


    sdata[ tid ] = ( A[ r*cols + c ] - B[ r*cols + c ] ) * ( A[ r*cols + c ] - B[ r*cols + c ] );

    __syncthreads();

    for ( unsigned int s = 1; s < blockDim.x; s*=2 ){
        if ( tid % (2*s) == 0 ){
            sdata[ tid ] += sdata[ tid + s ];
        }
    }
    __syncthreads();

    if ( tid == 0) C[blockIdx.x]=sdata[0];  
}

The code is based on the http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf. It is not the optimized version. I am just want to catch the basic point. I think that there is a problem where I initialize the sdata. Also the initialization of the kernel is done by this way:

int threadsPerBlock = 256;  
int blocksPerGrid = ceil( (double) numElements  / threadsPerBlock);

dim3 dimBlock(1, threadsPerBlock); 
dim3 dimGrid(blocksPerGrid, 1); 

cudaEuclid<<<dimGrid, dimBlock>>>( d_A, d_B, d_C, rows, cols );

Thank you and sorry for my ignorance.

score 3 · Accepted Answer

您正在使用动态分配的共享内存，但实际上并未分配任何共享内存。内核启动应该有一个额外的参数，用于每个块的共享内存大小。

cudaEuclid<<<dimGrid, dimBlock, threadsPerBlock*sizeof(float)>>>( d_A, d_B, d_C, rows, cols );

考虑使用CUB进行缩减 - 使您免于从头开始重新实现并进行了调整。
如果您想自己编写代码，该示例的版本比 CUDA 1.1-beta 中的版本更新！

score 0 · Accepted Answer

sdata[ tid ] += sdata[ tid ]; ==> you are just adding the same value twice you need to do

sdata[tid] += sdata[tid +s ]

cuda - Optimize the kernel function based on parellel reduction

2 回答 2

Related

Reference