在优化我的内核时,我尝试为每个线程的编译提供尽可能多的寄存器以供使用。我有一个由 1300 个点组成的网格,我可以将其任意划分为要同时处理的块。考虑到我的 CUDA 设备(GTX 460,计算能力 2.1)每个 SM 支持 32,768 个寄存器,我的数学技能告诉我,最多 672 个线程的两个块导致
32,768 / 1344 = 24
每个线程的寄存器。
通过编译我的内核
__global__ void
__launch_bounds__(672, 2)
moduleB3(...)
结果是
ptxas : info : Compiling entry function _Z8moduleB3PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_PiS_S_S_S_S_S_ffffiffffiiffii' for 'sm_20'
ptxas : info : Function properties for _Z8moduleB3PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_PiS_S_S_S_S_S_ffffiffffiiffii
48 bytes stack frame, 84 bytes spill stores, 44 bytes spill loads
ptxas : info : Used 20 registers, 184 bytes cmem[0], 24 bytes cmem[16]
不提供 launch_bounds() 时,寄存器使用率要高得多。我实际上有几个内核,其中任何一个中使用的最大寄存器数是 20,而我怀疑是 24。关于我的考虑在哪里有任何有根据的猜测?
编辑:事情是,当指定启动边界时,寄存器的使用会减少。以下是没有启动限制的编译器输出:
ptxas : info : Compiling entry function _Z10moduleA2_1PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_fffffffiiii' for 'sm_21'
ptxas : info : Function properties for _Z10moduleA2_1PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_fffffffiiii
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas : info : Used 56 registers, 140 bytes cmem[0], 40 bytes cmem[16]
在这里使用 __launch_bounds(672, 2):
ptxas : info : Compiling entry function '_Z10moduleA2_1PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_fffffffiiii' for 'sm_21'
ptxas : info : Function properties for _Z10moduleA2_1PfS_S_S_S_S_S_S_S_S_S_S_S_S_S_S_fffffffiiii
120 bytes stack frame, 156 bytes spill stores, 124 bytes spill loads
ptxas : info : Used 20 registers, 140 bytes cmem[0], 40 bytes cmem[16]
据我了解,编译器宁愿使用更多的寄存器,但由于资源限制而不能。但是,使用的寄存器加起来不等于可用的 32,768。如前所述,上限应为每个线程 24 个寄存器。如果编译器选择实现一些数量较少的内核,我可以理解,但是我的内核中没有一个使用更多没有启动限制的寄存器,请求超过 20 个。
我不认为发布内核会有什么好处,但你当然可以看看。以下是(希望)最简单的一个:
__global__ void
__launch_bounds__(672, 2)
moduleA2_1(float *d_t, float *d_x, float *d_p, float *d_rho, float *d_b, float *d_u,
float *d_ua, float *d_us, float *d_qa, float *d_qs, float *d_dlna,
float *d_cs, float *d_va, float *d_ma, float *d_uc2, float *d_rhs,
float k_b, float m_h, float gamma, float PI, float Gmsol, float r_sol, float fourpoint_constant, int radius, int numNodes, int numBlocks_A2_1, int numGridsPerSM)
{
int idx, idg, ids;
//input
float t, p, rho, b, u, ua, us, qa, qs, dlna;
//output
float a2, cs, va, ms, ma, vs12, vs22, uc2, dlna2, rhs;
extern volatile __shared__ float smemA21[];
float volatile *s_lna2;
s_lna2 = &smemA21[0];
ids = blockIdx.x / numBlocks_A2_1;
idx = (blockIdx.x % numBlocks_A2_1) * (blockDim.x - 2*radius) + threadIdx.x - radius;
idg = numGridsPerSM * ids;
while(idg < numGridsPerSM * (ids + 1))
{
if(idx >= 0 && idx < numNodes)
{
t = d_t[idg * numNodes + idx];
p = d_p[idg * numNodes + idx];
rho = d_rho[idg * numNodes + idx];
b = d_b[idg * numNodes + idx];
u = d_u[idg * numNodes + idx];
ua = d_ua[idg * numNodes + idx];
us = d_us[idg * numNodes + idx];
qa = d_qa[idg * numNodes + idx];
qs = d_qs[idg * numNodes + idx];
dlna = d_dlna[idg * numNodes + idx];
}
//computeA2(i); // isothermal sound speed (squared)
a2 = k_b / m_h * t;
//computeLna2(i);
s_lna2[threadIdx.x] = (float)log(a2);
//computeCs(i); // adiabatic sound speed
cs = gamma * p / rho;
d_checkInf(&cs);
cs = sqrt(cs);
//computeVa(i); // Alfven speed
va = b / (float)sqrt(4*PI*1E-7*rho);
d_checkInf(&va);
//computeMs(i); // sonic Mach number
ms = u / cs;
d_checkInf(&ms);
if(ms < FLT_MIN)
ms = FLT_MIN;
//computeMa(i); // Alfven Mach number
ma = u / va;
d_checkInf(&ma);
if(ma < FLT_MIN)
ma = FLT_MIN;
//computeUc2(i); // critival speed (squared)
uc2 = a2 + ua / (4 * rho) * (1 + 3 * ma)/(1 + ma) + 8 * us / (3 * rho) * (ms)/(1 + ms);
//computeVs12(i); // support value 1
vs12 = us / (3 * rho) * (1 - 7 * ms)/(1 + ms);
//computeVs22(i); // support value 2
vs22 = 4 * us / (3 * rho) * (ms - 1)/(ms + 1);
__syncthreads();
//fourpointLna2(i);
if((threadIdx.x > radius-1) && (threadIdx.x < blockDim.x - radius) && (idx < numNodes))
{
if (idx == 0) // FO-forward difference
dlna2 = (s_lna2[threadIdx.x+1] - s_lna2[threadIdx.x])/(d_x[idg * numNodes + idx+1] - d_x[idg * numNodes + idx]);
else if (idx == numNodes - 1) // FO-rearward difference
dlna2 = (s_lna2[threadIdx.x] - s_lna2[threadIdx.x-1])/(d_x[idg * numNodes + idx] - d_x[idg * numNodes + idx-1]);
else if (idx == 1 || idx == numNodes - 2) //SO-central difference
dlna2 = (s_lna2[threadIdx.x+1] - s_lna2[threadIdx.x-1])/(d_x[idg * numNodes + idx+1] - d_x[idg * numNodes + idx-1]);
else if(idx > 1 && idx < numNodes - 2 && threadIdx.x > 1 && threadIdx.x < blockDim.x - 2)
dlna2 = fourpoint_constant * ((s_lna2[threadIdx.x+1] - s_lna2[threadIdx.x-1])/(d_x[idg * numNodes + idx+1] - d_x[idg * numNodes + idx-1])) + (1-fourpoint_constant) * ((s_lna2[threadIdx.x+2] - s_lna2[threadIdx.x-2])/(d_x[idg * numNodes + idx+2] - d_x[idg * numNodes + idx-2]));
else
dlna2 = 0;
}
//par_computeRhs();
if(idx >= 0 && idx < numNodes)
{
if (u == 0)
rhs = - Gmsol / (float)pow(d_x[idg * numNodes + idx] + r_sol, 2) + (uc2 + vs12) * dlna - (a2 + vs22) * dlna2;
else
rhs = - Gmsol / (float)pow(d_x[idg * numNodes + idx] + r_sol, 2) + (uc2 + vs12) * dlna - (a2 + vs22) * dlna2 + 1 / rho * (qa / (2.0f*(u + va)) + 4.0f * qs / (3.0f*(u + cs)));
}
//par_calcSurfaceValues();
if(threadIdx.x > radius-1 && threadIdx.x < blockDim.x - radius && idx < numNodes)
{
d_cs[idg * numNodes + idx] = cs;
d_va[idg * numNodes + idx] = va;
d_ma[idg * numNodes + idx] = ma;
d_uc2[idg * numNodes + idx] = uc2;
d_rhs[idg * numNodes + idx] = rhs;
}
idg++;
}
}
感谢您抽出宝贵的时间。