cuda - 使用 AoS 的内核比使用 SoA 更快

Question

我有两个版本的内核执行相同的任务-填充链接的单元格列表-，两个内核之间的区别是存储粒子位置的数据类型，第一个使用浮点数组来存储位置（由于每个粒子 4 个浮点数到 128 位读/写），第二个使用 vec3f 结构数组来存储位置（一个包含 3 个浮点数的结构）。

使用 nvprof 进行一些测试，我发现第二个内核（使用 vec3f）比第一个内核运行得更快：

 Time(%)      Time   Calls       Avg       Min       Max  Name
   42.88    37.26s       2    18.63s   23.97us    37.26s  adentu_grid_cuda_filling_kernel(int*, int*, int*, float*, int, _vec3f, _vec3f, _vec3i)
   11.00     3.93s       2     1.97s   25.00us     3.93s  adentu_grid_cuda_filling_kernel(int*, int*, int*, _vec3f*, int, _vec3f, _vec3f, _vec3i)

测试完成尝试使用 256 和 512000 个粒子填充链接单元列表。

我的问题是，这里发生了什么？我认为由于合并的内存，浮点数组应该做更好的内存访问，而不是使用具有未对齐内存的 vec3f 结构数组。我误解了什么？

这些是内核，第一个内核：

__global__ void adentu_grid_cuda_filling_kernel (int *head,
                                                 int *linked,
                                                 int *cellnAtoms,
                                                 float *pos, 
                                                 int nAtoms, 
                                                 vec3f origin, 
                                                 vec3f h,
                                                 vec3i nCell)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx >= nAtoms)
        return;

    vec3i cell;
    vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};

    cell.x =  floor ((_pos.x - origin.x)/h.x);
    cell.y =  floor ((_pos.y - origin.y)/h.y);
    cell.z =  floor ((_pos.z - origin.z)/h.z);

    int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;

    int i;
    if (atomicCAS (&head[c], -1, idx) != -1){
        i = head[c];
        while (atomicCAS (&linked[i], -1, idx) != -1)
                i = linked[i];
    }
    atomicAdd (&cellnAtoms[c], 1);
}

这是第二个内核：

__global__ void adentu_grid_cuda_filling_kernel (int *head,
                                                 int *linked,
                                                 int *cellNAtoms,
                                                 vec3f *pos,
                                                 int nAtoms,
                                                 vec3f origin,
                                                 vec3f h,
                                                 vec3i nCell)
{
    int idx = threadIdx.x + blockIdx.x * blockDim.x;
    if (idx >= nAtoms)
        return;

    vec3i cell;
    vec3f _pos = pos[idx];

    cell.x = floor ((_pos.x - origin.x)/h.x);
    cell.y = floor ((_pos.y - origin.y)/h.y);
    cell.z = floor ((_pos.z - origin.z)/h.z);

    int c = nCell.x * nCell.y * cell.z + nCell.x * cell.y + cell.x;

    int i;
    if (atomicCAS (&head[c], -1, idx) != -1){
        i = head[c];
        while (atomicCAS (&linked[i], -1, idx) != -1)
                i = linked[i];
    }
    atomicAdd (&cellNAtoms[c], 1);
}

这是 vec3f 结构：

typedef struct _vec3f {float x, y, z} vec3f;

score 5 · Accepted Answer

这不是 AoS 与 SoA 的示例。让我们看看重要的代码行和其中隐含的数据结构。

您的第一个“SoA”或“慢”案例：

vec3f _pos = (vec3f){(float)pos[idx*4+0], (float)pos[idx*4+1], (float)pos[idx*4+2]};
                                      ^                    ^                    ^
                                      |                    |                    |
                               These values are stored in *adjacent* memory locations

因此，一个单独的线程正在连续访问pos[idx*4]以及紧随其后的 2 个位置。这就是存储结构的方式！你所说的数组结构实际上是一个结构数组，它存储在内存中的方式。要获得有效的“SoA”案例，您的代码需要如下所示：

vec3f _pos = (vec3f){(float)pos1[idx], (float)pos2[idx], (float)pos3[idx]};
                                 ^
                                 |
               Adjacent threads will read adjacent values for pos1, pos2, and pos3
                    leading to *coalesced* access.

您的“AoS”或“快速”实际上并没有不同的存储格式。

score 1 · Accepted Answer

在我看来，您的两种方法实际上都是 AoS，唯一的区别是第一种方法是具有四个元素结构的 AoS，而第二种方法仅使用三个元素。这就是为什么您的第二种解决方案更可取的原因。

如果您真的想在您的第一个解决方案中使用 SoA，则必须按如下方式组织 pos 数组：

vec3f _pos = (vec3f){(float)pos[idx], (float)pos[N + idx], (float)pos[2 * N + idx]};

cuda - 使用 AoS 的内核比使用 SoA 更快

2 回答 2

Related

Reference