concurrency - How to reduce CUDA synchronize latency / delay

Question

This question is related to using cuda streams to run many kernels

In CUDA there are many synchronization commands cudaStreamSynchronize, CudaDeviceSynchronize, cudaThreadSynchronize, and also cudaStreamQuery to check if streams are empty.

I noticed when using the profiler that these synchronize commands introduce a large delay to the program. I was wondering if anyone knows any means to reduce this latency apart from of course using as few synchronisation commands as possible.

Also is there any figures to judge the most effecient synchronisation method. that is consider 3 streams used in an application and two of them need to complete for me to launch a forth streams should i use 2 cudaStreamSyncs or just one cudaDeviceSync what will incur less loss ?

score 11 · Accepted Answer

同步方法之间的主要区别是“轮询”和“阻塞”。

“轮询”是驱动程序等待 GPU 的默认机制——它等待 32 位内存位置达到 GPU 写入的某个值。它可能会在等待解决后更快地返回等待，但在等待时，它会烧毁查看该内存位置的 CPU 内核。

可以通过调用或调用cudaSetDeviceFlags()来请求“阻止” 。阻塞等待导致驱动程序将命令插入 DMA 命令缓冲区，当缓冲区中的所有先前命令都已执行时，该命令会发出中断信号。然后，驱动程序可以将中断映射到 Windows 事件或 Linux 文件句柄，使同步命令能够等待而不会像默认轮询方法那样不断地烧毁 CPU。cudaDeviceScheduleBlockingSynccudaEventCreate()cudaEventBlockingSync

查询基本上是手动检查用于轮询等待的 32 位内存位置；所以在大多数情况下，它们非常便宜。但是如果启用了 ECC，查询将进入内核模式检查是否有任何 ECC 错误；在 Windows 上，任何挂起的命令都将刷新到驱动程序（这需要内核 thunk）。

concurrency - How to reduce CUDA synchronize latency / delay

1 回答 1

Related

Reference