我需要一些帮助来理解 Ron Farber 代码的行为:http ://www.drdobbs.com/parallel/cuda-supercomputing-for-the-masses-part/208801731?pgno=2
我不明白共享内存的使用如何比非共享内存版本提供更快的性能。即如果我添加更多索引计算步骤并使用添加另一个 Rd/Wr 循环来访问共享内存,这怎么能比单独使用全局内存更快?在任何一种情况下,相同数量或 Rd/Wr 周期都会访问全局内存。每个内核实例仍然只能访问一次数据。数据仍然使用全局内存输入/输出。内核实例的数量是相同的。寄存器计数看起来是一样的。添加更多处理步骤如何使其更快。(我们没有减去任何流程步骤。)本质上我们正在做更多的工作,而且完成得更快。
共享内存访问比全局快得多,但它不是零(或负数)。我错过了什么?
“慢”代码:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
int inOffset = blockDim.x * blockIdx.x;
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int in = inOffset + threadIdx.x;
int out = outOffset + (blockDim.x - 1 - threadIdx.x);
d_out[out] = d_in[in];
}
“快速”代码:
__global__ void reverseArrayBlock(int *d_out, int *d_in) {
extern __shared__ int s_data[];
int inOffset = blockDim.x * blockIdx.x;
int in = inOffset + threadIdx.x;
// Load one element per thread from device memory and store it
// *in reversed order* into temporary shared memory
s_data[blockDim.x - 1 - threadIdx.x] = d_in[in];
// Block until all threads in the block have written their data to shared mem
__syncthreads();
// write the data from shared memory in forward order,
// but to the reversed block offset as before
int outOffset = blockDim.x * (gridDim.x - 1 - blockIdx.x);
int out = outOffset + threadIdx.x;
d_out[out] = s_data[threadIdx.x];
}