我的内核中有一个连续的部分,它确实减慢了它的速度。但是,我不知道如何摆脱内部循环。这里有什么建议吗?
__global__ void myKernel( int keep, int inc, int width, int* d_Xnum,
int* d_Xco, bool* d_Xvalid,int* d_A )
{
int i = blockIdx.x * blockDim.x + threadIdx.x;
int j = blockIdx.y * blockDim.y + threadIdx.y;
int k1;
if( i < keep && j <= i){
int counter = 0;
for(k1 = 0; k1 < inc; k1++){
if(d_Xvalid[j*inc + k1] == 0)
counter += (d_Xvalid[i*inc + d_Xco[j*width + k1]]);
}
d_A[i*keep+j] = inc - d_Xnum[i] - counter;
}
}
我相信消除k1
会大大加快我的代码速度。但是,我不知道如何counter
使用它。任何建议,想法,想法都将受到欢迎!这个内核被称为:
...
int t = 32;
int b = keep/(32)+1;
int b2 = (inc/32)+1;
dim3 thread (t, t);
dim3 block (b, inc);
// kernel call
myKernel<<<block, thread>>>(k, inc, width, d_Xnum,
d_Xco, d_Xvalid, d_A);
cudaThreadSynchronize();
...
keep
大约是 9000 和inc
20000 左右