-1

我的内核中有一个连续的部分,它确实减慢了它的速度。但是,我不知道如何摆脱内部循环。这里有什么建议吗?

__global__ void myKernel( int keep, int inc, int width, int* d_Xnum,
 int* d_Xco, bool* d_Xvalid,int* d_A )
{
  int i = blockIdx.x * blockDim.x + threadIdx.x;
  int j = blockIdx.y * blockDim.y + threadIdx.y;

  int k1;

  if( i < keep && j <= i){
    int counter = 0;

    for(k1 = 0; k1 < inc; k1++){
      if(d_Xvalid[j*inc + k1] == 0)
         counter += (d_Xvalid[i*inc + d_Xco[j*width + k1]]);
    }

    d_A[i*keep+j] = inc - d_Xnum[i] - counter;
  }
}

我相信消除k1会大大加快我的代码速度。但是,我不知道如何counter使用它。任何建议,想法,想法都将受到欢迎!这个内核被称为:

         ...
  int t = 32;
  int b = keep/(32)+1;
  int b2 = (inc/32)+1;
  dim3 thread (t, t);
  dim3 block (b, inc);

  // kernel call
  myKernel<<<block, thread>>>(k, inc, width, d_Xnum,
                  d_Xco, d_Xvalid, d_A);
  cudaThreadSynchronize();
            ...

keep大约是 9000 和inc20000 左右

4

1 回答 1

2

这不是您问题的确切答案,但它可能可以优化您的代码,并可能帮助您实现k1sum 的并行减少,因为您摆脱了if( i < keep && j <= i). 您还可以根据您的 gpu 模型实施其他优化,例如使用纹理访问那些只读向量。

由于您生成索引的方式,许多线程停止等待其他线程完成。您正在启动keep*inc线程,但实际上只有最大数量的线程keep*(keep+1)/2在做某事(因为条件j <= i)。

我认为您可以通过以下更改使其变得更好:

  1. 启动keep*(keep+1)/2线程

  2. 对您的代码执行以下操作

    __global__ void myKernel( int keep, int inc, int width, int* d_Xnum,
    int* d_Xco, bool* d_Xvalid,int* d_A )
    {
      int k = blockIdx.x * blockDim.x + threadIdx.x;
      int i = (int)(sqrt(0.25+2.0*k)-0.5); 
      int j = k - i*(i+1)/2;
    
      int k1;
      if( i < keep && j < inc){
        int counter = 0;
    
        for(k1 = 0; k1 < inc; k1++){
          if(d_Xvalid[j*inc + k1] == 0)
             counter += (d_Xvalid[i*inc + d_Xco[j*width + k1]]);
        }
    
        d_A[i*keep+j] = inc - d_Xnum[i] - counter;
      }
    }
    

你在做什么(对于keep = 4启动4*4 = 16线程,在最好的情况下。如果inc > keep看起来是这样,你正在启动更多线程)可以被视为(每个“盒子”都是一个线程)

_________________________________
| i = 0 | i = 0 | i = 0 | i = 0 |
| j = 0 |   -   |   -   |   -   |
_________________________________
| i = 1 | i = 1 | i = 1 | i = 1 |
| j = 0 | j = 1 |   -   |   -   |
_________________________________
| i = 2 | i = 2 | i = 2 | i = 2 |
| j = 0 | j = 1 | j = 2 |   -   |
_________________________________
| i = 3 | i = 3 | i = 3 | i = 3 |
| j = 0 | j = 1 | j = 2 | j = 3 |
_________________________________

我建议您添加一个索引并根据您的需要k生成i并从中生成(用于启动线程)jkeep = 4(4*(4+1)/2 = 10

_________________________________________________________________________________
| k = 0 | k = 0 | k = 1 | k = 0 | k = 1 | k = 2 | k = 0 | k = 1 | k = 2 | k = 3 |
| i = 0 | i = 1 | i = 1 | i = 2 | i = 2 | i = 2 | i = 3 | i = 3 | i = 3 | i = 3 |
| j = 0 | j = 0 | j = 1 | j = 0 | j = 1 | j = 2 | j = 0 | j = 1 | j = 2 | j = 3 |
_________________________________________________________________________________

这可以用

  • i = (int)(sqrt(0.25+2*k)-0.5)

  • j = k - i*(i+1)/2

你可以把它当作一个食谱来接受,或者稍微了解一下它背后的数学原理。

到这里你知道j = 0你有i*(i+1)/2 = k(因为 k = 0+1+2+...+i = i*(i+1)/2)。现在,如果你求解这个方程,你就会得到方程i(int 的强制转换向下舍入并确保它也能得到正确的结果j!=0)。为了得到j你应该减去k它如果j是 0:的值i*(i+1)/2

于 2012-10-03T02:12:42.503 回答