cuda - 哪些变量会消耗 CUDA 中的寄存器？

Question

__global__ void add( int *c, const int* a, const int* b )
{
    int x = blockIdx.x;
    int y = blockIdx.y;
    int offset = x + y * gridDim.x;
    c[offset] = a[offset] + b[offset];
}

在上面的例子中，我猜x, y,offset保存在寄存器中，而

nvcc -Xptxas -v给出4 registers, 24+16 bytes smem
profiler显示 4 个寄存器

和ptx文件的头：

.reg .u16 %rh<4>;
.reg .u32 %r<9>;    
.reg .u64 %rd<10>;  
.loc    15  21  0   

$LDWbegin__Z3addPiPKiS1_:   
.loc    15  26  0

任何人都可以澄清寄存器的用法吗？在 Fermi 中，每个线程的最大寄存器数为 63。在我的程序中，我想测试内核消耗太多寄存器的情况（因此变量可能必须自动存储在本地内存中，从而导致性能下降）。然后此时我可以将一个内核分成两个，以便每个线程都有足够的寄存器。假设 SM 资源足够并发内核。

我不确定我是否正确。

score 16 · Accepted Answer

PTX中的寄存器分配与内核最终的寄存器消耗完全无关。PTX 只是最终机器码的中间表示，采用静态单一赋值形式，意味着 PTX 中的每个寄存器只使用一次。一块有数百个寄存器的 PTX 可以编译成只有几个寄存器的内核。

寄存器分配是ptxas作为一个完全独立的编译过程完成的（静态或由驱动程序即时执行，或两者兼而有之），它可以对输入 PTX 执行大量代码重新排序和优化，以提高吞吐量并节省寄存器，这意味着原始 C 中的变量或 PTX 中的寄存器与组装内核的最终寄存器计数之间几乎没有关系。

nvcc does provide some ways to influence the register allocation behaviour of the assembler. You have __launch_bounds__ to provide heuristic hints to the compiler which can influence register allocation, and the compiler/assembler takes the -maxrregcount argument (at the potential expense of register spilling to local memory, which can lower performance). The volatile keyword used to make a difference to older versions of the nvopen64 based compiler and could influence the local memory spill behaviour. But you can't arbitrarily control or steer register allocation in the original C code or PTX assembly language code.

cuda - 哪些变量会消耗 CUDA 中的寄存器？

1 回答 1

Related

Reference