当我阅读 Cuda-SDK 中的 nbody 代码时,我浏览了代码中的一些行,我发现它与他们在 GPUGems3 "Fast N-Body Simulation with CUDA" 中的论文有点不同。
我的问题是:首先,为什么 blockIdx.x 仍然参与从全局加载内存到共享内存,如下面的代码中所写?
for (int tile = blockIdx.y; tile < numTiles + blockIdx.y; tile++)
{
sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
multithreadBodies ?
positions[WRAP(blockIdx.x + q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] : //this line
positions[WRAP(blockIdx.x + tile, gridDim.x) * p + threadIdx.x]; //this line
__syncthreads();
// This is the "tile_calculation" function from the GPUG3 article.
acc = gravitation(bodyPos, acc);
__syncthreads();
}
纸上不应该是这样的吗?我想知道为什么
sharedPos[threadIdx.x+blockDim.x*threadIdx.y] =
multithreadBodies ?
positions[WRAP(q * tile + threadIdx.y, gridDim.x) * p + threadIdx.x] :
positions[WRAP(tile, gridDim.x) * p + threadIdx.x];
其次,在每个主体的多个线程中,为什么仍然涉及 threadIdx.x?它不应该是一个固定值还是根本不涉及,因为总和只是由于 threadIdx.y
if (multithreadBodies)
{
SX_SUM(threadIdx.x, threadIdx.y).x = acc.x; //this line
SX_SUM(threadIdx.x, threadIdx.y).y = acc.y; //this line
SX_SUM(threadIdx.x, threadIdx.y).z = acc.z; //this line
__syncthreads();
// Save the result in global memory for the integration step
if (threadIdx.y == 0)
{
for (int i = 1; i < blockDim.y; i++)
{
acc.x += SX_SUM(threadIdx.x,i).x; //this line
acc.y += SX_SUM(threadIdx.x,i).y; //this line
acc.z += SX_SUM(threadIdx.x,i).z; //this line
}
}
}
谁能给我解释一下?是对更快的代码进行某种优化吗?