memory - 复制到cuda中的共享内存

Question

在CUDA编程中，如果我们要使用共享内存，我们需要将数据从全局内存带到共享内存。线程用于传输此类数据。

我在某处（在线资源中）读到，最好不要让块中的所有线程都将数据从全局内存复制到共享内存。这样的想法是有道理的，所有线程都没有一起执行。经线中的线程一起执行。但我担心的是所有的经纱都不是按顺序执行的。比如说，一个带有线程的块被分为 3 个 warp：warp0（0-31 个线程），warp 1（32-63 个线程），warp 2（64-95 个线程）。不能保证 warp 0 会首先执行（对吗？）。

那么我应该使用哪些线程将数据从全局复制到共享内存？

score 7 · Accepted Answer

To use a single warp to load a shared memory array, just do something like this:

__global__
void kernel(float *in_data)
{
    __shared__ float buffer[1024];

    if (threadIdx.x < warpSize) {
        for(int i = threadIdx; i  <1024; i += warpSize) {
            buffer[i] = in_data[i];
        }
    }
    __syncthreads();

    // rest of kernel follows
}

[disclaimer: written in browser, never tested, use at own risk]

The key point here is the use of __syncthreads() to ensure that all threads in the block wait until the warp performing the load to shared memory have finished the load. The code I posted used the first warp, but you can calculate a warp number by dividing the thread index within the block by the warpSize. I also assumed a one-dimensional block, it is trivial to compute the thread index in a 2D or 3D block, so I leave that as an exercise to the reader.

score 0 · Accepted Answer

当块被分配给多处理器时，它一直驻留在该块中的所有线程完成之前，并且在此期间，warp 调度程序在具有准备好的操作数的 warp 之间进行混合。因此，如果多处理器上有一个带有三个 warp 的块，并且只有一个 warp 正在从全局内存中获取数据到共享内存，而另外两个 warp 处于空闲状态并且可能正在等待__syncthreads()屏障，那么您什么都不会丢失，并且您只会受到全局内存延迟的限制。反正你会的。一旦获取完成，warps 就可以继续他们的工作。

因此，不能保证首先执行 warp0 并且您可以使用任何线程。唯一需要记住的两件事是确保尽可能多地合并访问全局内存并避免存储库冲突。

memory - 复制到cuda中的共享内存

2 回答 2

Related

Reference