保持 CUDA kenel 的寄存器/线程数低有什么好处吗?
我认为没有优势(速度或其他)。3 reg/线程的上下文切换与 48 reg/线程一样快。除非您不想使用,否则不使用所有可用寄存器是没有意义的。内核之间不共享寄存器。这是错的吗?
编辑: 来自 CUDA4.2 编程指南(5.2.3):
The number of registers used by a kernel can have a significant impact on the number
of resident warps. For example, for devices of compute capability 1.2, if a kernel uses 16
registers and each block has 512 threads and requires very little shared memory, then two
blocks (i.e. 32 warps) can reside on the multiprocessor since they require 2x512x16
registers, which exactly matches the number of registers available on the multiprocessor.
But as soon as the kernel uses one more register, only one block (i.e. 16 warps) can be
resident since two blocks would require 2x512x17 registers, which are more registers than
are available on the multiprocessor. Therefore, the compiler attempts to minimize register
usage while keeping register spilling (see Section 5.3.2.2) and the number of instructions
to a minimum.
“regs/thread”计数似乎并不像总 reg 计数那么重要。