cuda - Kepler CUDA 动态并行和线程发散

Question

关于Kepler的资料很少dynamic parallelism，从这项新技术的描述来看，是不是意味着同一个warp中线程控制流发散的问题已经解决了？

它允许recursion从设备代码中获取内核，这是否意味着不同线程中的控制路径可以同时执行？

score 3 · Accepted Answer

看看这篇论文

动态并行、流发散和递归是分离的概念。动态并行性是在线程内启动线程的能力。这意味着例如你可以这样做

__global__ void t_father(...)   {
   ...
   t_child<<< BLOCKS, THREADS>>>();
   ...
}

我个人在这方面进行了调查，当你做这样的事情时，当 t_father 启动 t_child 时，整个 vga 资源再次分配给这些资源，并且 t_father 等到所有 t_child 都完成后才能继续（另请参阅this paper Slide 25 )

递归自 Fermi 以来可用，并且是线程在没有任何其他线程/块重新配置的情况下调用自身的能力

关于流分歧，我想我们永远不会看到扭曲中的线程同时执行不同的代码..

score 0 · Accepted Answer

There's a sample cuda source in this NVidia presentation on slide 9.

__global__ void convolution(int x[])
{
   for j = 1 to x[blockIdx]
      kernel<<< ... >>>(blockIdx, j)
}

It goes on to show how part of the CUDA control code is moved to the GPU, so that the kernel can spawn other kernel functions on partial dompute domains of various sizes (slide 14).

The global compute domain and the partitioning of it are still static, so you can't actually go and change this DURING GPU computation to e.g. spawn more kernel executions because you've not reached the end of your evaluation function yet. Instead, you provide an array that holds the number of threads you want to spawn with a specific kernel.

score 0 · Accepted Answer

不，Warp 的概念仍然存在。warp 中的所有线程都是 SIMD（单指令多数据），这意味着它们同时运行一条指令。即使您调用子内核，GPU 也会为您的调用指定一个或多个扭曲。使用动态并行时，请记住 3 件事：

你可以去的最深是 24 (CC=3.5)。
同时运行的动态内核的数量是有限的（默认为 4096），但可以增加。
在子内核调用之后让父内核保持忙碌，否则很可能会浪费资源。

cuda - Kepler CUDA 动态并行和线程发散

3 回答 3

Related

Reference