我为康威的生活游戏编写了这个 CUDA 内核:
__global__ void gameOfLife(float* returnBuffer, int width, int height) {
unsigned int x = blockIdx.x*blockDim.x + threadIdx.x;
unsigned int y = blockIdx.y*blockDim.y + threadIdx.y;
float p = tex2D(inputTex, x, y);
float neighbors = 0;
neighbors += tex2D(inputTex, x+1, y);
neighbors += tex2D(inputTex, x-1, y);
neighbors += tex2D(inputTex, x, y+1);
neighbors += tex2D(inputTex, x, y-1);
neighbors += tex2D(inputTex, x+1, y+1);
neighbors += tex2D(inputTex, x-1, y-1);
neighbors += tex2D(inputTex, x-1, y+1);
neighbors += tex2D(inputTex, x+1, y-1);
__syncthreads();
float final = 0;
if(neighbors < 2) final = 0;
else if(neighbors > 3) final = 0;
else if(p != 0) final = 1;
else if(neighbors == 3) final = 1;
__syncthreads();
returnBuffer[x + y*width] = final;
}
我正在寻找错误/优化。并行编程对我来说相当新,我不确定我是否能正确地做到这一点。
其余的是从输入数组到绑定到 CUDA 数组的 2D 纹理 inputTex 的 memcpy。输出从全局内存到主机,然后进行处理。
如您所见,线程处理单个像素。我不确定这是否是最快的方式,因为一些消息来源建议每个线程执行一行或更多。如果我理解正确,NVidia 自己说线程越多越好。我希望有实践经验的人对此提出建议。