2

我很好奇为什么如果我在寄存器使用中设置一个上限(在我的示例中为 51),它可以产生比我让限制无界时更高的寄存器内核。

此外,较高的寄存器似乎更快(10us 超过 700)。

优化阶段的哪些阶段发生变化?

4

1 回答 1

2

I cannot provide much insight into the actual CUDA compiler and its stages, but some common sense reasoning based on CUDA's execution architecture.

When not setting a maximum register number the compiler doesn't know what your target register number is and has to assume that you need to use as few registers as possible or employs some other heuristic. In general minimizing register usage per-thread means there are enough registers for more threads on a single core and thus maximizes utilization because more thread blocks can be resident on a single core, which is good.

But when you give a maximum register usage, the compiler knows that this is your maximum and assumes that up to that maximum it can use as much registers as possible. The reason for this is that the points where register occupation is too high and there are not enough registers for yet another thread block are actually hard limits. When there are not enough registers for yet another block once a single thread uses 65 registers, then it just doesn't matter if it uses 63 or 64 registers, as long as it doesn't use 65. So the compiler tries to use as much registers as possible (up to the maximum, of course), which is desirable, because registers are the fastest memory type you can get. But this reasoning can only be applied when the compiler knows this hard limit (i.e. you tell him), otherwise it has to employ some heuristics, which might not always be optimal.

And the reason for why the version with 48 registers is faster than the one with 47 is likely because it, well, uses more registers. If not enough registers are available data has to be swapped out into local memory or copied repeatedly into temporary registers from other registers.

In the end this all makes perfect sense, because the more information you give the compiler (by setting your optimal register maximum), the better it can optimize and the more efficient the resulting code should be. And especially with GPU computing it is usually desirable to tune your kernels to the actual hardware and its resources as best as possible.

于 2013-07-02T11:06:39.097 回答