cuda - how does CUDA schedule its threads

Question

i've got a few questions regarding cuda's scheduling system.

A.When i use for example the foo<<<255, 255>>() function, what actually happens inside of the card? i know that each SM receives from the upper level a block to schedule, and each SM is responsible to schedule its incoming BLOCK, but which part does it? if for example i've got 8 SMs, when each of each contains 8 small CPUs, is the upper level responsible to schedule the remaining 255*255 - (8 * 8) threads?

B.What's the limit of maximum threads that one can define? i mean foo<<<X, Y>>>(); x,y =?

C. Regarding the last example, how many threads can be inside of one block? can we say that the more blocks / threads we have, the faster the execution will be?

Thanks for your help

score 3 · Accepted Answer

A. 计算工作分配器将块从网格分配到 SM。SM 将块转换为扭曲（WARP_SIZE = 32 在所有 NVIDIA GPU 上）。Fermi 2.0 GPU 每个 SM 都有两个 warp 调度器，它们共享一组数据路径。每个周期每个 warp 调度程序都会选择一个 warp 并向其中一个数据路径发出指令（请不要考虑 CUDA 内核）。在 Fermi 2.1 GPU 上，每个 warp 调度器都有独立的数据路径以及一组共享数据路径。在 2.1 的每个周期中，每个 warp 调度程序都会选择一个 warp 并尝试为每个 warp 发出双重指令。

warp 调度器试图优化数据路径的使用。这意味着单个warp可能会在背靠背循环中执行多个指令，或者warp调度程序可以选择每个周期从不同的warp发出。

每个 SM 可以处理的扭曲/线程数在 CUDA 编程指南 v.4.2 表 F-1 中指定。这从 768 线程扩展到 2048 线程（24-64 经线）。

B. 每次启动的最大线程数由最大 GridDims * 每个块的最大线程数定义。请参阅表 F-1 或参阅 cudaGetDeviceProperties 的文档。

C. 参见与 (B) 相同的资源。线程/块的最佳分布由您的算法分区定义，并受占用计算的影响。基于 SM 上扭曲的问题集大小和在指令屏障处阻塞的时间量（除其他外），存在可观察到的性能影响。对于初学者，我建议每个 SM 至少有 2 个街区，入住率约为 50%。

score 0 · Accepted Answer

B. 这取决于您的设备。您可以使用 cuda 功能cudaGetDeviceProperties查看设备的规格。常见的最大数量是每个块 y=1024 个线程，每个网格维度 x=65535 个块。

CA 的常见做法是拥有 2 个（128,256,512 等）个线程/块的幂。以这种方式减少大型数组非常有效（请参阅Reduction）。块和线程的最佳分布实际上取决于您的应用程序和硬件。我个人在 TeslaM2050 上使用 512 个线程/块进行大型稀疏线性代数计算，因为它对我的应用程序来说是最有效的。

cuda - how does CUDA schedule its threads

2 回答 2

Related

Reference