0

我的项目将有多个线程,每个线程在不同的 cudaStreams 上发布内核执行。其他一些线程将使用将存储在队列中的结果这里的一些伪代码:

while(true) {
  cudaMemcpyAsync(d_mem, h_mem, some_stream) 
  kernel_launch(some_stream)
  cudaMemcpyAsync(h_queue_results[i++], d_result, some_stream)
}

在第一个 cudaMemcpyAsync 返回后重用 h_mem 是否安全?还是应该使用 N 个主机缓冲区来发出 gpu 计算?

如何知道何时可以重用 h_mem?我应该使用 cudaevents 进行一些同步吗?

顺便提一句。h_mem 是主机固定的。如果它是可分页的,我可以直接重用 h_mem 吗?从我在这里读到的内容看来,我可以在 memcpyasync 返回后立即重用,对吗?

异步

对于从可分页主机内存到设备内存的传输,主机内存会立即复制到暂存缓冲区(不执行设备同步)。一旦可分页缓冲区被复制到暂存内存,该函数将返回。到最终目的地的 DMA 传输可能尚未完成。对于固定主机内存和设备内存之间的传输,该函数是完全异步的。对于从设备内存到可分页主机内存的传输,该函数仅在复制完成后返回。对于所有其他传输,该函数是完全异步的。如果必须首先将可分页内存暂存到固定内存,这将使用工作线程异步处理。对于从任何主机存储器到任何主机存储器的传输,该函数相对于主机是完全同步的。

MemcpyAsynchronousBehavior

谢谢!

4

1 回答 1

1

In order to get copy/compute overlap, you must use pinned memory. The reason for this is contained in the paragraph you excerpted. Presumably the whole reason for your multi-streamed approach is for copy/compute overlap, so I don't think the correct answer is to switch to using pageable memory buffers.

Regarding your question, assuming h_mem is only used as the source buffer for the pseudo-code you've shown here (i.e. the data in it only participates in that one cudaMemcpyAsync call), then the h_mem buffer is no longer needed once the next cuda operation in that stream begins. So if your kernel_launch were an actual kernel<<<...>>>(...), then once kernel begins, you can be assured that the previous cudaMemcpyAsync is complete.

You could use cudaEvents with cudaEventSynchronize() or cudaStreamWaitEvent(), or you could use cudaStreamSynchronize() directly in the stream. For example, if you have a cudaStreamSynchronize() call somewhere in the stream pseudocode you have shown, and it is after the cudaMemcpyAsync call, then any code after the cudaStreamSynchronize() call is guaranteed to be executing after the cudaMemcpyAsync() call is complete. All of the calls I've referenced are documented in the usual place.

于 2013-04-17T14:03:15.730 回答