cuda - Could setting an independent variable early, increase performance?

Question

Threads don’t stall on memory access

From the famous paper http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf by Vasily Volkov

I am assuming based on this statement that this:

__device__ int a;
int b, c, d;
a = b * c;
// Do some work that is independent of 'a'
// ...
d = a + 1;

Is faster than this

__device__ int a;
int b, c, d;
a = b * c;
d = a + 1;
// Do some work that is independent of 'a'
// ...

I am only assuming that because I am giving the chance to the thread to execute different instructions while writing to the global memory, while in the second approach I am not.

Is my assumption right?

And if my assumption is right, then is it a good practice to set all variables that are going to be used later, in the beginning of the kernel? Given that they are independent from each other, also assuming that a is not cached.

score 2 · Accepted Answer

真正提到的停顿是内存读取。

它指出，内存读取不会产生停顿，使用读取的值假设它不可用，会导致停顿。

假设我有：

__device__ int a[32];

那么这个线程代码不会导致停顿（虽然它会生成一个内存事务）：

int b = a[0];

但如果我这样做，我会得到一个摊位：

int b = a[0];
int c = a[1];
int d = b * c; // stall occurs here

因此，如果我能做到这一点：

int b = a[0];
int c = a[1];

//  do lots of other work here
int d = b * c; // this might not stall

对于 Fermi 和 Kepler GPU，对全局内存的写入（以及从先前写入的值中读取，假设它们尚未从缓存中清除）由缓存提供服务，因此看起来正在写入全局内存的线程代码通常是写入 L1或 L2 缓存，而对全局内存的实际写入事务将在稍后发生，并且不一定会导致任何形式的停顿。

因此，在您的示例中，通常a将由缓存提供服务：

__device__ int a;
int b, c, d;
a = b * c; // a gets written to cache
d = a + 1; // a is serviced from cache

请注意，来自缓存的服务仍然比最快的访问机制（例如寄存器和共享内存）慢，但它比全局内存停顿快得多。

说了这么多，编译器通常会做一些可能会影响这一点的事情。首先，不是您手动重新排序代码，编译器可能会发现独立的工作，并且在某种程度上为您重新排序代码。其次，在您的示例中，除了在某些时候更新全局内存中的值之外，编译器将发现它a被重用并很可能将其分配给寄存器变量。它在寄存器中的事实意味着a在上面示例的最后一行中使用很可能会从寄存器中得到服务，而不是全局内存或缓存。

因此，要回答您的问题，我想说的是，您的假设通常是不正确的。编译器将发现重用a并将其分配给寄存器，从而彻底消除您认为存在的危险。理论上，如果没有缓存（对于计算 1.x 设备来说是这样）并且没有寄存器，那么编译器可能会按照您的建议被迫使用全局内存，但实际上不会发生。

cuda - Could setting an independent variable early, increase performance?

1 回答 1

Related

Reference