concurrency - 并发 CUDA 内核执行的优先级

Question

我有两个可以同时执行的内核（A 和 B）。我需要内核 A 尽快完成（进行结果的 MPI 交换）。所以我可以在一个流中执行它们：A 然后 B。

但是，内核 A 的线程块很少，因此如果我按顺序运行 A 和 B，则在 A 运行时 GPU 没有得到充分利用。

是否可以同时执行 A 和 B 而 A 具有更高的优先级？

即，我希望内核 B 的线程块仅在内核 A没有未启动的块时才开始执行。

据我了解，如果我在一个流中启动内核 A，并且在主机代码的下一行，在另一个流中启动内核 B，我不能保证来自 B 的线程块实际上不会首先执行？

score 3 · Accepted Answer

NVIDIA 现在提供了一种优先处理 CUDA 内核的方法。这是一个相当新的功能，因此您需要升级到 CUDA 5.5 才能使用。

对于您的情况，您将kernel A在高优先级 CUDA 流中启动，并kernel B在低优先级 CUDA 流中启动。您可能想要的功能是cudaStreamCreateWithPriority(..., priority).

要使用此功能，您需要具有 Compute Capability 3.5 或更高版本的 GPU。要检查您的 GPU 是否支持优先级，请查看cudaDeviceProp::streamPrioritiesSupported.
cudaDeviceGetStreamPriorityRange应该告诉您 GPU 上有多少优先级可用。for 的语法cudaDeviceGetStreamPriorityRange有点奇怪。值得在 CUDA 手册中查看它是如何工作的。

CUDA Runtime API 手册中有关优先级设置的更详细文档：

cudaError_t cudaStreamCreateWithPriority(cudaStream_t *pStream, 
                                         unsigned int flags, int priority)
Create an asynchronous stream with the specified priority.

Parameters
pStream  = Pointer to new stream identifier 
flags    = Flags for stream creation. See cudaStreamCreateWithFlags for a list of 
           valid flags that can be passed 
priority = Priority of the stream. Lower numbers represent higher priorities. See  
           cudaDeviceGetStreamPriorityRange for more information about the 
           meaningful stream priorities that can be passed.

concurrency - 并发 CUDA 内核执行的优先级

1 回答 1

Related

Reference