cuda - 未合并的 float2 CUDA 内核

Question

我在优化下面示例的网格和块大小时遇到问题。当我进行分析时，内核代码中的内存写入操作似乎没有合并。

我在互联网上找到了一些解决方案，但他们建议我将 c_image 的结构更改为[x1, x2, x3...] [y1, y2, y3...]

但是我需要这个结构，[x1, y1] [x2, y2]...因为我在其他地方的代码上使用了 CUFFT 库，这需要这种精确的形式。

在 c_image 结构中是否有一种合并的方式来执行操作？

我的代码：

void main()
{
    float2 *c_image;  // x1 y1 x2 y2 x3 y3 x4 y4 .. .. .. .. x2048 y2048
    cudamalloc(c_image, 2048*2048*8);

    //warp size = 32
    //max thread count = 1024
    dim3 blocksize(1024, 1);
    dim3 gridsize(2048, 2);
    test<<<gridsize, blocksize>>(C_image);  
}


__global__ void test(float2 *o) 
{
    int x = threadIdx.x + blockIdx.y*1024;
    int y = blockIdx.x;

    int index = x + 2048*y;

        o[index].x = x;
        o[index].y = y;
}

非常感谢！

PS：我试过这个，但没有运气！CUDA float2 合并

score 3 · Accepted Answer

使用临时 float2 变量将其减少为单个赋值应该会导致 64 位写入。

_global__ void test(float2 *o) 
{
    int x = threadIdx.x + blockIdx.y * 1024;
    int y = blockIdx.x;
    int index = x + 2048 * y;
    float2 tmp = float2(x, y);
    o[index] = tmp;
}

更多详细信息可在

cuda - 未合并的 float2 CUDA 内核

1 回答 1

Related

Reference