nSight 分析器告诉我,以下内核每个线程使用 52 个寄存器:
//Just the first lines of the kernel.
__global__ void voles_kernel(float *params, int *ctrl_params,
float dt, float currTime,
float *dev_voles, float *dev_weasels,
curandStateMtgp32 *state)
{
__shared__ float dev_params[9];
__shared__ int BuYeSimStep[4];
if(threadIdx.x < 4)
{
BuYeSimStep[threadIdx.x] = ctrl_params[threadIdx.x];
}
if(threadIdx.x < 9){
dev_params[threadIdx.x] = params[threadIdx.x];
}
__syncthreads();
float currVole = curand_uniform(&state[blockIdx.x]) + 3.0;
float currWeas = curand_uniform(&state[blockIdx.x]) + 0.1;
float oldVole = currVole;
float oldWeas = currWeas;
int jj;
if (blockIdx.x * blockDim.x + threadIdx.x < BuYeSimStep[2])
{
int dayIndex = 0;
/* Not declaring any new variable from here on, just doing arithmetics.
....... */
如果每个寄存器有 4 个字节,我不明白我们如何得到 52 个寄存器,即使假设数组 params[9] 和 ctrl_params[4] 最终在寄存器中(在这种情况下,像我一样使用共享内存不会使感觉)。我想增加入住率,但我不明白为什么要使用这么多寄存器。有任何想法吗?