To use a single warp to load a shared memory array, just do something like this:
__global__
void kernel(float *in_data)
{
__shared__ float buffer[1024];
if (threadIdx.x < warpSize) {
for(int i = threadIdx; i <1024; i += warpSize) {
buffer[i] = in_data[i];
}
}
__syncthreads();
// rest of kernel follows
}
[disclaimer: written in browser, never tested, use at own risk]
The key point here is the use of __syncthreads()
to ensure that all threads in the block wait until the warp performing the load to shared memory have finished the load. The code I posted used the first warp, but you can calculate a warp number by dividing the thread index within the block by the warpSize. I also assumed a one-dimensional block, it is trivial to compute the thread index in a 2D or 3D block, so I leave that as an exercise to the reader.