cuda - 正确使用 cudaMalloc3D 和 cudaMemcpy

Question

我想在每个维度上发送一个src大小为 3D 的数组，将其size展平为大小为 1D 的数组length = size * size * size，发送到内核中，计算结果并将其存储在dst. 但是，最后，dst不正确地包含所有 0。这是我的代码：

int size = 256;
int length = size * size * size;
int bytes = length * sizeof(float);

// Allocate source and destination arrays on the host and initialize source array

float *src, *dst;
cudaMallocHost(&src, bytes);
cudaMallocHost(&dst, bytes);
for (int i = 0; i < length; i++) {
    src[i] = i;
}

// Allocate source and destination arrays on the device

struct cudaPitchedPtr srcGPU, dstGPU;
struct cudaExtent extent = make_cudaExtent(size*sizeof(float), size, size);
cudaMalloc3D(&srcGPU, extent);
cudaMalloc3D(&dstGPU, extent);

// Copy to the device, execute kernel, and copy back to the host

cudaMemcpy(srcGPU.ptr, src, bytes, cudaMemcpyHostToDevice);
myKernel<<<numBlocks, blockSize>>>((float *)srcGPU.ptr, (float *)dstGPU.ptr);
cudaMemcpy(dst, dstGPU.ptr, bytes, cudaMemcpyDeviceToHost);

为了清楚起见，我省略了对cudaMallocHost(),cudaMalloc()的错误检查cudaMemcpy()。在任何情况下，此代码都不会触发错误。

cudaMalloc3D()with的正确用法是cudaMemcpy()什么？

如果我也应该为内核发布一个最小的测试用例，或者问题是否可以在上面的代码中找到，请告诉我。

score 3 · Accepted Answer

编辑：如果使用 CUDA 数组，范围会占用元素的数量，但如果不使用 CUDA 数组，则有效地占用字节数（例如，分配有一些非数组变体的内存cudaMalloc）

从运行时 API CUDA 文档：

范围字段定义元素中传输区域的尺寸。如果 CUDA 数组参与复制，则根据该数组的元素定义范围。如果没有 CUDA 数组参与复制，则范围在unsigned char的元素中定义

此外，cudaMalloc3D返回一个倾斜的指针，这意味着它至少具有您提供的范围的尺寸，但出于对齐原因可能更多。在访问设备内存和从设备内存复制数据时，您必须考虑到这一点。有关结构的文档，请参见此处cudaPitchedPtr

至于使用cudaMalloc3Dwith cudaMemcpy，您可能想看看 using cudaMemcpy3D（此处的文档），考虑到主机和设备内存的音高，它可能会让您的生活更轻松一些。要使用cudaMemcpy3D，您必须使用cudaMemcpy3DParms适当的信息创建一个结构。它的成员是：

cudaArray_t dstArray
struct cudaPos dstPos
struct cudaPitchedPtr dstPtr
struct cudaExtent extent
enumcudaMemcpyKind kind
cudaArray_t srcArray
struct cudaPos srcPos
struct cudaPitchedPtr srcPtr

并且您必须指定其中一个srcArray or srcPtr和一个dstArray or dstPtr。此外，文档建议在使用之前将结构初始化为 0，例如 cudaMemcpy3DParms myParms = {0};

此外，您可能有兴趣查看其他 SO question

cuda - 正确使用 cudaMalloc3D 和 cudaMemcpy

1 回答 1

Related

Reference