3

我在 CUDA 内核中遇到(我相信是)共享内存库冲突。代码本身相当复杂,但我在下面附加的简单示例中复制了它。

在这种情况下,它被简化为从全局 -> 共享 -> 全局内存的简单副本,大小为 16x16 的 2D 数组,使用可能在右侧填充的共享内存数组(变量ng)。

如果我使用 NVVP 编译代码ng=0并检查共享内存访问模式,它会告诉我“没有问题”。例如,我在标有"NVVP warning"ng=2的行中得到"Shared Store Transactions/Access = 2, Ideal Transactions/Acces = 1 " 。我不明白为什么(或更具体地说:为什么填充会导致警告)。

编辑 如下格雷格史密斯所述,在开普勒上有 32 个 8 字节宽的银行(http://gpgpu.org/wp/wp-content/uploads/2013/09/08-opti-smem-instr.pdf,幻灯片18). 但我看不出这会如何改变问题。

如果我理解正确,有 32 个(B1, B2, ..)4 字节的银行,双打(D01, D02, ..)存储为:

B1   B2   B3   B4   B5    ..   B31
----------------------------------
D01       D02       D03   ..   D15
D16       D17       D18   ..   D31
D32       D33       D34   ..   D47

没有填充,半扭曲写入 ( as[ijs] = in[ij]) 到 shared-memoryD01 .. D15D16 .. D31。使用填充(大小为 2),前半扭曲写入D01 .. D15,填充后的第二个写入D18 .. D33,这仍然不应该导致银行冲突吗?

知道这里可能出了什么问题吗?

简化示例(使用 cuda 6.5.14 测试):

// Compiled with nvcc -O3 -arch=sm_35 -lineinfo 

__global__ void copy(double * const __restrict__ out, const double * const __restrict__ in, const int ni, const int nj, const int ng)

{
    extern __shared__ double as[];
    const int ij=threadIdx.x + threadIdx.y*blockDim.x;
    const int ijs=threadIdx.x + threadIdx.y*(blockDim.x+ng);

    as[ijs] = in[ij]; // NVVP warning
    __syncthreads();
    out[ij] = as[ijs]; // NVVP warning
}

int main()
{
    const int itot = 16;
    const int jtot = 16;
    const int ng = 2;
    const int ncells = itot * jtot;

    double *in  = new double[ncells];
    double *out = new double[ncells];
    double *tmp = new double[ncells];
    for(int n=0; n<ncells; ++n)
        in[n]  = 0.001 * (std::rand() % 1000) - 0.5;

    double *ind, *outd;
    cudaMalloc((void **)&ind,  ncells*sizeof(double));
    cudaMalloc((void **)&outd, ncells*sizeof(double));
    cudaMemcpy(ind,  in,  ncells*sizeof(double), cudaMemcpyHostToDevice);
    cudaMemcpy(outd, out, ncells*sizeof(double), cudaMemcpyHostToDevice);

    dim3 gridGPU (1, 1 , 1);
    dim3 blockGPU(16, 16, 1);

    copy<<<gridGPU, blockGPU, (itot+ng)*jtot*sizeof(double)>>>(outd, ind, itot, jtot, ng);

    cudaMemcpy(tmp, outd, ncells*sizeof(double), cudaMemcpyDeviceToHost);

    return 0;
}
4

1 回答 1

3

事实证明,我没有正确理解 Keppler 架构。正如 Greg Smith 在上述评论之一中所指出的,Keppler 可以配置为拥有 32 个 8 字节的共享内存库。在这种情况下,使用cudaDeviceSetSharedMemConfig( cudaSharedMemBankSizeEightByte ),共享内存布局如下所示:

bank:  B0   B1   B2   B3   B4    ..   B31
       ----------------------------------
index: D00  D01  D02  D03  D04   ..   D31
       D32  D33  D34  D35  D36   ..   D63   

Now, for my simple example (using itot=16), the writing/reading to/from shared memory on e.g. the first two rows (threadIdx.y=0, threadIdx.y=1) is handled within one warp. This means that for threadIdx.y=0 values D00..D15 are stored in B0..B15, then there is a padding of two doubles, after which within the same warp values D18..D33 are stored in B18..B31+B00..B01, which causes a bank conflict on B00-B01. Without the padding (ng=0) the first row is written to D00..D15 in B00..B15, the second row in D16..D31 in B16..B31, so no bank conflict occurs.

For a thread block of blockDim.x>=32 the problem shouldn’t occur. For example, for itot=32, blockDim.x=32, ng=2, the first row is stored in banks B00..B31, then two cells padding, second row in B02..B31+B00..B01, etc.

于 2015-02-09T09:23:08.823 回答