macos - 每个在 CUDA 中运行的结果不同

Question

接下来我在 cuda 中编程：

我得到一个带有图像值的矩阵（d_Data）
我将矩阵复制到共享内存（平铺）中
我得到一个基本的像素差异 (pixel(i)-pixel(i+1)) (d_diff)
如果差异是特定值，例如 0，则在 d_diff 中找到数字 0 的每个位置的矩阵 (d_A) 中设置数字 1。这是为了能够得到原始差异矩阵中0的频率。
并行累积和。
频率结果转到频率向量。

一步一步检查，一切都按预期进行，直到累积总和。当我启动代码时，软件计算的值为 104347，但有时从 CUDA 我得到一个 nan 结果，其他时候我得到任何数字，例如 2425。非常奇怪的是，如果我坚持运行内核 20 或 30 次，该值变为预期的 104347 :S。

我正在使用每个矩阵：

h_Data  = (float *)malloc(data_size);
h_diff  = (float *)malloc(data_size);
h_A         = (float *)malloc(data_size);

和

 cudaFree(d_A);
cudaFree(d_diff);
cudaFree(d_Av);

所以我不明白为什么当我运行足够多的时间时代码越来越接近正确的结果。顺便说一句，当达到正确的值时，无论我运行多少次代码，它都不再移动。

编码：

 __global__ void spam(float *d_Data, float *d_diff, float *d_A, int dw, int dh, float *d_Av){

long bx = blockIdx.x;  long by = blockIdx.y;
long tx = threadIdx.x; long ty = threadIdx.y;


// Identify the row and column of the Pd element to work on
long Row = by * TILE_WIDTH + ty;
long Col = bx * TILE_WIDTH + tx;
long tid = Row*dw+Col;
long i=512*512;
long r = MASK_DIM/2;
long s = 0;

 __shared__ int tile[BLOCK_WIDTH][BLOCK_WIDTH];

for (int k=0; k<=8; k++)
     d_Av[k]=0; 


    if(tid < dw*dh)
    {

   // to shared memory.
                                          tile[ty + r][tx + r]=d_Data[Row*dw+Col];
        if (Col-r >=0)                    tile[ty + r]  [tx] = d_Data[Row*dw+Col-r];
        if (Col+r <dw)                    tile[ty + r]  [tx + 2*r] = d_Data[Row*dw+Col+r];
        if (Row-r >=0)                    tile[ty]      [tx + r] = d_Data[(Row - r)*dw + Col];
        if (Row+r <dw)                    tile[ty + 2*r][tx + r] = d_Data[(Row + r)*dw + Col];
        if (Row - r >= 0 && Col - r >= 0) tile[ty]      [tx] = d_Data[(Row-r)*dw+Col-r];
        if(Row - r >= 0 && Col + r < dw)  tile[ty]      [tx + 2*r] = d_Data[(Row-r)*dw+Col+r];
        if (Row + r < dw && Col - r >= 0) tile[ty + 2*r][tx] = d_Data[(Row+r)*dw+Col-r];
        if(Row + r <dw && Col + r < dw)   tile[ty + 2*r][tx + 2*r] = d_Data[(Row-r)*dw+Col+r]; 

        //Calculates the difference matrix
       d_diff[tid] = (tile[ty + r][tx +r] - tile[ty + r][tx + r + 1]);


        d_A[tid]=0;

       //Set a 1 in each position in d_A where 0 was found in d_diff.
        if (d_diff[tid] == 0)
        { d_A[tid]=1;}
        __syncthreads();

        //cumulative sum to get the frecuency of value 0 in d_diff.  // The error is HERE
      for (s = (i/2); s>=1; s=s/2) {
            if (tid < s)
            {   d_A[tid] += d_A[tid+s];
            }
        } 

       // set the frequency value in frequencies vector.
        d_Av[0] = d_A[0];

}} // END IF tid < dw*dh

欢迎任何想法：D

score 1 · Accepted Answer

您可以尝试用以下代码替换 if 语句： d_A[tid] += d_A[tid+s] * (tid < s);

并确保此代码不会导致竞争条件。它通常可能是并行求和的情况。

MK

macos - 每个在 CUDA 中运行的结果不同

1 回答 1

Related

Reference