2

我需要一些帮助来理解 Ron Farber 代码的行为:http ://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/208801731?pgno=2

我不明白共享内存的使用如何比非共享内存版本提供更快的性能。即如果我添加更多索引计算步骤并使用添加另一个 Rd/Wr 循环来访问共享内存,这怎么能比单独使用全局内存更快?在任何一种情况下,相同数量或 Rd/Wr 周期都会访问全局内存。每个内核实例仍然只能访问一次数据。数据仍然使用全局内存输入/输出。内核实例的数量是相同的。寄存器计数看起来是一样的。添加更多处理步骤如何使其更快。(我们没有减去任何流程步骤。)本质上我们正在做更多的工作,而且完成得更快。

共享内存访问比全局快得多,但它不是零(或负数)。我错过了什么?

“慢”代码:

__global__ void reverseArrayBlock(int *d_out, int *d_in) {
int inOffset  = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int in  = inOffset + threadIdx.x;
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in];
}

“快速”代码:

__global__ void reverseArrayBlock(int *d_out, int *d_in) {
extern __shared__ int s_data[];

int inOffset  = blockDim.x * blockIdx.x;
int in  = inOffset + threadIdx.x;

// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];

// Block until all threads in the block have written their data to shared mem
__syncthreads();

// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}
4

3 回答 3

6

Early CUDA-enabled devices (compute capability < 1.2) would not treat the d_out[out] write in your "slow" version as a coalesced write. Those devices would only coalesce memory accesses in the "nicest" case where i-th thread in a half warp accesses i-th word. As a result, 16 memory transactions would be issued to service the d_out[out] write for every half warp, instead of just one memory transaction.

Starting with compute capability 1.2, the rules for memory coalescing in CUDA became much more relaxed. As a result, the d_out[out] write in the "slow" version would also get coalesced, and using shared memory as a scratch pad is no longer necessary.

The source of your code sample is article "CUDA, Supercomputing for the Masses: Part 5", which was written in June 2008. CUDA-enabled devices with compute capability 1.2 only arrived on the market 2009, so the writer of the article clearly talked about devices with compute capability < 1.2.

For more details, see section F.3.2.1 in the NVIDIA CUDA C Programming Guide.

于 2012-08-13T19:37:46.910 回答
0

这是因为共享内存更靠近计算单元,因此延迟和峰值带宽不会成为此计算的瓶颈(至少在矩阵乘法的情况下)

但最重要的是,最重要的原因是 tile 中的许多数字被许多线程重用。因此,如果您从全局访问,您将多次检索这些数字。将它们一次写入共享内存将消除浪费的带宽使用

于 2012-08-13T18:21:20.127 回答
0

在查看全局内存访问时,慢代码向前读取并向后写入。快速代码向前读取和写入。我认为快速代码更快,因为缓存层次结构在某种程度上进行了优化,以降序访问全局内存(朝向更高的内存地址)。

CPUs do some speculative fetching, where they will fill cache lines from higher memory addresses before the data has been touched by the program. Maybe something similar happens on the GPU.

于 2012-08-13T19:23:03.793 回答