cuda - cudaMemcpy() 调用流

Question

考虑这两个代码片段。

Snippet1

cudaStream_t stream1, stream2 ;
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpyAsync( dst, src, size, dir, stream1 );
kernel<<<grid, block, 0, stream2>>>(...);



Snippet2
cudaStreamCreate(&stream1);
cudaStreamCreate(&stream2);
cudaMemcpy( dst, src, size, dir, stream1 );
kernel<<<grid, block, 0, stream2>>>(...);

在这两个片段中，我都发出了 memcpy 调用（snippet1 异步和 snippet2 同步）

由于命令已发布到两个不同的流，据我了解，这两种情况都可能存在重叠。

但是在 Snippet2 中，cudaMemcpy 调用是同步的（又名阻塞）导致我得出一个矛盾的结论，即 cudaMemcpy 和内核调用将一个接一个地执行。

哪一个是正确的结论？

更简洁地说：当我们向流发出 cudaMemcpy 调用时，它会阻塞“整个代码”还是只是阻塞发出它的流？

score 3 · Accepted Answer

在操作完成之前，同步调用不会将控制权返回给 CPU，因此在 memcpy 完成之前，您的第二个片段甚至不会开始提交内核启动。

您的cudaMemcpy()电话看起来不正确；我认为您不能为任何不以“异步”结尾的 memcpy 变体指定流参数。如所写，编译器可能会接受代码并将流作为 memcpy 方向。

score 1 · Accepted Answer

ArcheaSoftware 部分正确。实际上，在操作完成之前，同步调用不会将控制权返回给 CPU。从这个意义上说，您的内核启动只会在cudaMemcpy调用返回后发生。但是，根据您的缓冲区类型，内核可能会也可能不会使用cudaMemcpy调用传输的数据。下面的一些例子：

示例 1：

cudaMallocHost(&src, size);
cudaMalloc(&dst, size);
cudaMemcpy(dst, src, size, cudaMemcpyHostToDevice);
kernel<<<grid, block, 0, stream2>>>(...);

在这种情况下，内核可以使用从复制src到的数据dst。

示例 2：

src = malloc(size);
cudaMalloc(&dst, size);
cudaMemcpy(dst, src, size, cudaMemcpyHostToDevice);
kernel<<<grid, block, 0, stream2>>>(...);

In this case, cudaMemcpy can return before the data is actually transferred to the device.

cudaMemcpy from unregistered host buffers (e.g., malloc buffers) only guarantees that the data is copied out of the source buffer, perhaps into an intermediate staging buffer, before the call returns. This is surprising behavior, but is defined as such in the NVIDIA CUDA documents. Ref: https://docs.nvidia.com/cuda/cuda-runtime-api/api-sync-behavior.html#api-sync-behavior

In general, I recommend avoiding using unregistered host buffers because of such behavior.

cuda - cudaMemcpy() 调用流

2 回答 2

Related

Reference