cuda - CUDA 的 nppiMalloc... 函数如何保证对齐？

Question

有一段时间让我感到困惑的是分配的 CUDA 内存的对齐要求。我知道如果它们对齐，访问行元素会更有效率。

先说一点背景：

根据 CUDA C 编程指南（第 5.3.2 节）：

全局内存驻留在设备内存中，设备内存通过 32、64 或 128 字节内存事务访问。这些内存事务必须自然对齐只有与其大小对齐的设备内存的 32、64 或 128 字节段（即，其首地址是其大小的倍数）可以被内存事务读取或写入。

我的理解是，对于类型为的 2D 交错数组T（例如 R、G、B 顺序的像素值），如果numChannels * sizeof(T)是 4、8 或 16，则必须使用cudaMallocPitch如果性能是必需的情况下分配数组。到目前为止，这对我来说一直很好。我会numChannels * sizeof(T)在分配 2D 数组之前进行检查，如果它是 4、16 或 32，我会使用它进行分配cudaMallocPitch，一切正常。

现在的问题：

我已经意识到，当使用 NVIDIA 的 NPP 库时，有一系列分配器函数（nppiMalloc... likenppiMalloc_32f_C1等等）。NVIDIA 建议使用这些功能来提高性能。我的问题是，这些功能如何保证对齐？更具体地说，他们使用什么样的数学来得出合适的值pitch？

对于单通道 512x512 像素图像（浮点像素值在 [0, 1] 范围内），我同时使用了cudaMallocPitch和nppiMalloc_32f_C1。
cudaMallocPitch给了我 2048 的音高值，而nppiMalloc_32f_C1给了我 2560。后一个数字来自哪里，到底是多少？

为什么我关心这个
我正在编写一个同步的内存类模板，用于在 GPU 和 CPU 上同步值。这个类应该负责在引擎盖下分配倾斜的记忆（如果可能的话）。因为我希望这个类可以与 NVIDIA 的 NPP 互操作，所以我想以一种为 CUDA 内核和 NPP 操作提供良好性能的方式处理所有分配。
我的印象是nppiMalloc在后台打电话cudaMallocPitch，但似乎我错了。

score 3 · Accepted Answer

一个有趣的问题。但是，可能根本没有明确的答案，原因如下：这些方法的实现不公开。人们不得不假设 NVIDIA 在内部使用了一些特殊的技巧和调整。此外：未指定生成的音高。因此，必须假设它可能会在多个版本的 CUDA/NPP 之间发生变化。特别是，实际音高不太可能取决于执行该方法的设备的硬件版本（“计算能力”）。

尽管如此，我对此感到好奇并编写了以下测试：

#include <stdio.h>
#include <npp.h>

template <typename T>
void testStepBytes(const char* name, int elementSize, int numComponents, 
    T (*allocator)(int, int, int*))
{
    printf("%s\n", name);
    int dw = 1;
    int prevStepBytes = 0;
    for (int w=1; w<2050; w+=dw)
    {
        int stepBytes;
        void *p = allocator(w, 1, &stepBytes);
        nppiFree(p);
        if (stepBytes != prevStepBytes)
        {
            printf("Stride %5d is used up to w=%5d (%6d bytes)\n", 
                prevStepBytes, (w-dw), (w-dw)*elementSize*numComponents);
            prevStepBytes = stepBytes;
        }
    }
}

int main(int argc, char *argv[])
{
    testStepBytes("nppiMalloc_8u_C1", 1, 1, &nppiMalloc_8u_C1);
    testStepBytes("nppiMalloc_8u_C2", 1, 2, &nppiMalloc_8u_C2);
    testStepBytes("nppiMalloc_8u_C3", 1, 3, &nppiMalloc_8u_C3);
    testStepBytes("nppiMalloc_8u_C4", 1, 4, &nppiMalloc_8u_C4);

    testStepBytes("nppiMalloc_16u_C1", 2, 1, &nppiMalloc_16u_C1);
    testStepBytes("nppiMalloc_16u_C2", 2, 2, &nppiMalloc_16u_C2);
    testStepBytes("nppiMalloc_16u_C3", 2, 3, &nppiMalloc_16u_C3);
    testStepBytes("nppiMalloc_16u_C4", 2, 4, &nppiMalloc_16u_C4);

    testStepBytes("nppiMalloc_32f_C1", 4, 1, &nppiMalloc_32f_C1);
    testStepBytes("nppiMalloc_32f_C2", 4, 2, &nppiMalloc_32f_C2);
    testStepBytes("nppiMalloc_32f_C3", 4, 3, &nppiMalloc_32f_C3);
    testStepBytes("nppiMalloc_32f_C4", 4, 4, &nppiMalloc_32f_C4);

    return 0;
}

间距（stepBytes）似乎完全取决于图像的宽度。因此，该程序为不同类型的图像分配内存，宽度不断增加，并打印有关导致特定步幅的最大图像尺寸的信息。其目的是推导出一种模式或规则——即您所询问的“数学类型”。

结果……有点混乱。例如，对于nppiMalloc_32f_C1呼叫，在我的机器（CUDA 6.5、GeForce GTX 560 Ti、Compute Capability 2.1）上，它会打印：

nppiMalloc_32f_C1
Stride     0 is used up to w=    0 (     0 bytes)
Stride   512 is used up to w=  120 (   480 bytes)
Stride  1024 is used up to w=  248 (   992 bytes)
Stride  1536 is used up to w=  384 (  1536 bytes)
Stride  2048 is used up to w=  504 (  2016 bytes)
Stride  2560 is used up to w=  640 (  2560 bytes)
Stride  3072 is used up to w=  768 (  3072 bytes)
Stride  3584 is used up to w=  896 (  3584 bytes)
Stride  4096 is used up to w= 1016 (  4064 bytes)
Stride  4608 is used up to w= 1152 (  4608 bytes)
Stride  5120 is used up to w= 1280 (  5120 bytes)
Stride  5632 is used up to w= 1408 (  5632 bytes)
Stride  6144 is used up to w= 1536 (  6144 bytes)
Stride  6656 is used up to w= 1664 (  6656 bytes)
Stride  7168 is used up to w= 1792 (  7168 bytes)
Stride  7680 is used up to w= 1920 (  7680 bytes)
Stride  8192 is used up to w= 2040 (  8160 bytes)

确认对于宽度 = 512 的图像，它将使用 2560 的步幅。预期的 2048 步幅将用于宽度为 504 的图像。

这些数字看起来有点奇怪，所以我进行了另一个测试，nppiMalloc_8u_C1以覆盖所有可能的图像行大小（以字节为单位），图像大小更大，并注意到一个奇怪的模式：间距大小的第一次增加（从 512 到 1024 ) 当图像大于 480 字节时发生，并且 480=512-32。下一步（从 1024 到 1536）发生在图像大于 992 字节时，992=480+512。下一步（从 1536 到 2048）发生在图像大于 1536 字节时，1536=992+512+32。从那里开始，它似乎主要以 512 步运行，除了中间的几个尺寸。此处总结了进一步的步骤：

nppiMalloc_8u_C1
Stride      0 is used up to w=     0 (     0 bytes, delta     0)
Stride    512 is used up to w=   480 (   480 bytes, delta   480)
Stride   1024 is used up to w=   992 (   992 bytes, delta   512)
Stride   1536 is used up to w=  1536 (  1536 bytes, delta   544)
Stride   2048 is used up to w=  2016 (  2016 bytes, delta   480) \
Stride   2560 is used up to w=  2560 (  2560 bytes, delta   544) | 4
Stride   3072 is used up to w=  3072 (  3072 bytes, delta   512) |
Stride   3584 is used up to w=  3584 (  3584 bytes, delta   512) /
Stride   4096 is used up to w=  4064 (  4064 bytes, delta   480)     \
Stride   4608 is used up to w=  4608 (  4608 bytes, delta   544)     |
Stride   5120 is used up to w=  5120 (  5120 bytes, delta   512)     |
Stride   5632 is used up to w=  5632 (  5632 bytes, delta   512)     | 8
Stride   6144 is used up to w=  6144 (  6144 bytes, delta   512)     |
Stride   6656 is used up to w=  6656 (  6656 bytes, delta   512)     |
Stride   7168 is used up to w=  7168 (  7168 bytes, delta   512)     |
Stride   7680 is used up to w=  7680 (  7680 bytes, delta   512)     /
Stride   8192 is used up to w=  8160 (  8160 bytes, delta   480) \
Stride   8704 is used up to w=  8704 (  8704 bytes, delta   544) |
Stride   9216 is used up to w=  9216 (  9216 bytes, delta   512) |
Stride   9728 is used up to w=  9728 (  9728 bytes, delta   512) |
Stride  10240 is used up to w= 10240 ( 10240 bytes, delta   512) |
Stride  10752 is used up to w= 10752 ( 10752 bytes, delta   512) |
Stride  11264 is used up to w= 11264 ( 11264 bytes, delta   512) |
Stride  11776 is used up to w= 11776 ( 11776 bytes, delta   512) | 16
Stride  12288 is used up to w= 12288 ( 12288 bytes, delta   512) |
Stride  12800 is used up to w= 12800 ( 12800 bytes, delta   512) |
Stride  13312 is used up to w= 13312 ( 13312 bytes, delta   512) |
Stride  13824 is used up to w= 13824 ( 13824 bytes, delta   512) |
Stride  14336 is used up to w= 14336 ( 14336 bytes, delta   512) |
Stride  14848 is used up to w= 14848 ( 14848 bytes, delta   512) |
Stride  15360 is used up to w= 15360 ( 15360 bytes, delta   512) |
Stride  15872 is used up to w= 15872 ( 15872 bytes, delta   512) /
Stride  16384 is used up to w= 16352 ( 16352 bytes, delta   480)     \
Stride  16896 is used up to w= 16896 ( 16896 bytes, delta   544)     |
Stride  17408 is used up to w= 17408 ( 17408 bytes, delta   512)     |
...                                                                ... 32
Stride  31232 is used up to w= 31232 ( 31232 bytes, delta   512)     |
Stride  31744 is used up to w= 31744 ( 31744 bytes, delta   512)     |
Stride  32256 is used up to w= 32256 ( 32256 bytes, delta   512)     /
Stride  32768 is used up to w= 32736 ( 32736 bytes, delta   480) \
Stride  33280 is used up to w= 33280 ( 33280 bytes, delta   544) |
Stride  33792 is used up to w= 33792 ( 33792 bytes, delta   512) |
Stride  34304 is used up to w= 34304 ( 34304 bytes, delta   512) |
...                                                            ... 64
Stride  64512 is used up to w= 64512 ( 64512 bytes, delta   512) |
Stride  65024 is used up to w= 65024 ( 65024 bytes, delta   512) /
Stride  65536 is used up to w= 65504 ( 65504 bytes, delta   480)     \
Stride  66048 is used up to w= 66048 ( 66048 bytes, delta   544)     |   
Stride  66560 is used up to w= 66560 ( 66560 bytes, delta   512)     |
Stride  67072 is used up to w= 67072 ( 67072 bytes, delta   512)     |
....                                                               ... 128
Stride 130048 is used up to w=130048 (130048 bytes, delta   512)     |
Stride 130560 is used up to w=130560 (130560 bytes, delta   512)     /
Stride 131072 is used up to w=131040 (131040 bytes, delta   480) \
Stride 131584 is used up to w=131584 (131584 bytes, delta   544) |
Stride 132096 is used up to w=132096 (132096 bytes, delta   512) |
...                                                              | guess...

显然是有规律的。间距与 512 的倍数有关。对于 512*2 ⁿ的大小，其中 n 是整数，对于尺寸限制有一些奇数的 -32 和 +32 偏移，导致使用更大的间距。

也许我会再看看这个。我很确定有人可以推导出一个涵盖这种奇怪进展的公式。但同样：这可能取决于底层 CUDA 版本、NPP 版本，甚至是所使用卡的计算能力。

_{而且，为了完整起见：这种奇怪的间距大小也可能只是 NPP 中的一个错误。你永远不会知道。}

score 1 · Accepted Answer

我想我会贡献其他几种分配类型的列表。我正在使用带有 cuda 7.5 版的 GTX 860M。

cudaMallocPitch 与 textureAlignment 属性对齐，而不是我怀疑的 texturePitchAlignment。nppi malloc 也与 textureAlignment 边界对齐，但有时会过度分配并提前跳转到下一个 512 字节。

由于所有这些函数都将每一行对齐到 textureAlignment 而不是较小的 texturePitchAlignment 使用更多空间，但是纹理应该能够绑定到任何起始行，而不必使用字节偏移量进行地址计算。纹理的文档可能不清楚，但事实证明它们需要的线间距是 32 的倍数（在这一代硬件上，texturePitchAlignment 属性），并且起点的地址必须是 128 的倍数， 256 或 512，具体取决于硬件和 cuda 版本（textureAlignment）。纹理可能能够绑定到更小的倍数，我自己在找到正确属性之前的经验是 256 字节对齐似乎可以正常工作。

512 字节对齐相当大，但是纹理和非纹理可能比使用 texturePitchAlignment 值有性能提升。我没有做过任何测试。为了将来打样，我建议使用 cudaMallocPitch 或 nppiMalloc，但如果内存空间紧张，如果使用纹理，您可以使用 texturePitchAlignment 手动分配。如果您使用的是 cudaMemcpy2D 或类似功能，则 PCI 总线上的内存带宽应保持较大间距。我建议使用 Nvidia 功能在 PCI 总线上复制倾斜的内存。如果他们还没有高度优化并使用 DMA 控制器，他们最终会实现它。对于较小的间距，在批量传输中仅通过 PCI 总线上的填充进行复制可能会提高内存效率，但这需要在另一边进行测试和潜在的 CPU 去填充。我想知道 Nvidia 功能是否会在传输之前在 GPU 上取消填充？还是逐行DMA传输？也许最终，如果他们还没有。

int main(int argc, char **argv)
{
    void *dmem;
    int pitch, pitchOld = 0;
    size_t pitch2;
    int iOld = 0;
    int maxAllocation = 5000;

    cudaDeviceProp prop;

    cudaGetDeviceProperties(&prop, 0);      

    printf("%s%d%s%d%s", "textureAlignment ", prop.textureAlignment, " texturePitchAlignment ", prop.texturePitchAlignment, "\n");

    printf("%s", "cudaMallocPitch\n");

    for (int i=0;i<maxAllocation;++i) {
        cudaMallocPitch(&dmem, &pitch2, i, 1);

        if (pitch2 != pitchOld && i!= 0) {
            printf("%s%d%s%d%s%d%s", "width ", iOld, "to", i-1, " -> pitch ", pitchOld, "\n");
            pitchOld = pitch2;
            iOld = i;
        }

        cudaFree(dmem);
    }
    pitchOld = 0;

    printf("%s", "nppiMalloc_8u_C1\n");

    for (int i=0;i<maxAllocation/sizeof(Npp8u);++i) {
        dmem = nppiMalloc_8u_C1(i, 1, &pitch);

        if (pitch != pitchOld && i!= 0) {
            printf("%s%d%s%d%s%d%s", "width ", iOld, "to", i-1, " -> pitch ", pitchOld, "\n");
            pitchOld = pitch;
            iOld = i;
        }

        cudaFree(dmem);
    }
    pitchOld = 0;

    printf("%s", "nppiMalloc_32f_C1\n");

    for (int i=0;i<maxAllocation/sizeof(Npp32f);++i) {
        dmem = nppiMalloc_32f_C1(i, 1, &pitch);

        if (pitch != pitchOld && i!= 0) {
            printf("%s%d%s%d%s%d%s", "width ", iOld, "to", i-1, " -> pitch ", pitchOld, "\n");
            pitchOld = pitch;
            iOld = i;
        }

        cudaFree(dmem);
    }
    pitchOld = 0;

    return 0;
}

和输出

textureAlignment 512 texturePitchAlignment 32
cudaMallocPitch
width 0to0 -> pitch 0
width 1to512 -> pitch 512
width 513to1024 -> pitch 1024
width 1025to1536 -> pitch 1536
width 1537to2048 -> pitch 2048
width 2049to2560 -> pitch 2560
width 2561to3072 -> pitch 3072
width 3073to3584 -> pitch 3584
width 3585to4096 -> pitch 4096
width 4097to4608 -> pitch 4608
nppiMalloc_8u_C1
width 0to0 -> pitch 0
width 1to480 -> pitch 512
width 481to992 -> pitch 1024
width 993to1536 -> pitch 1536
width 1537to2016 -> pitch 2048
width 2017to2560 -> pitch 2560
width 2561to3072 -> pitch 3072
width 3073to3584 -> pitch 3584
width 3585to4064 -> pitch 4096
width 4065to4608 -> pitch 4608
nppiMalloc_32f_C1
width 0to0 -> pitch 0
width 1to120 -> pitch 512
width 121to248 -> pitch 1024
width 249to384 -> pitch 1536
width 385to504 -> pitch 2048
width 505to640 -> pitch 2560
width 641to768 -> pitch 3072
width 769to896 -> pitch 3584
width 897to1016 -> pitch 4096
width 1017to1152 -> pitch 4608

cuda - CUDA 的 nppiMalloc... 函数如何保证对齐？

2 回答 2

Related

Reference