我有以下代码使用共享内存执行平铺矩阵转置以提高性能。共享内存用 1 列填充,以避免 32x32 线程块的银行冲突。
__global__ void transpose_tiled_padded(float *A, float *B, int n)
{
int i_in = blockDim.x*blockIdx.x + threadIdx.x;
int j_in = blockDim.y*blockIdx.y + threadIdx.y;
int i_out = blockDim.x*blockIdx.y + threadIdx.x;
int j_out = blockDim.y*blockIdx.x + threadIdx.y;
extern __shared__ float tile[];
// coalesced read of A rows to (padded) shared tile column (transpose)
tile[threadIdx.y + threadIdx.x*(blockDim.y+1)] = A[i_in + j_in*n];
__syncthreads();
// coalesced write from (padded) shared tile column to B rows
B[i_out + j_out*n] = tile[threadIdx.x + threadIdx.y*(blockDim.x+1)];
}
运行这段代码,我在 NVIDIA 视觉分析器中获得了 100% 的共享内存效率,正如我所期望的那样。但是,当我使用 16x16 线程块运行它时,我只能获得 50% 的效率。这是为什么?据我所知,经线中的线程没有从具有这种布局的同一银行读取。还是我弄错了?