In one of my previous posts I asked how it was possible to improve a kernel function. The kernel compute the squared euclidean distance between the corresponding rows of two equal sized matrices. Eric gave a very good tip to use one thread block per row and after that apply parallel reduction. Before continue with further details this post is made because I did not want to make more complicated the previous post and I give my thanks to Eric. Below I attached the .cu code which is not give me the correct results.
__global__ void cudaEuclid( float* A, float* B, float* C, int rows, int cols )
{
extern __shared__ float sdata[];
unsigned int tid = threadIdx.x;
unsigned int c = blockDim.x * blockIdx.x + threadIdx.x; // rows
unsigned int r = blockDim.y * blockIdx.y + threadIdx.y; // cols
sdata[ tid ] = ( A[ r*cols + c ] - B[ r*cols + c ] ) * ( A[ r*cols + c ] - B[ r*cols + c ] );
__syncthreads();
for ( unsigned int s = 1; s < blockDim.x; s*=2 ){
if ( tid % (2*s) == 0 ){
sdata[ tid ] += sdata[ tid + s ];
}
}
__syncthreads();
if ( tid == 0) C[blockIdx.x]=sdata[0];
}
The code is based on the http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf. It is not the optimized version. I am just want to catch the basic point. I think that there is a problem where I initialize the sdata
. Also the initialization of the kernel is done by this way:
int threadsPerBlock = 256;
int blocksPerGrid = ceil( (double) numElements / threadsPerBlock);
dim3 dimBlock(1, threadsPerBlock);
dim3 dimGrid(blocksPerGrid, 1);
cudaEuclid<<<dimGrid, dimBlock>>>( d_A, d_B, d_C, rows, cols );
Thank you and sorry for my ignorance.