cuda - About warp voting function

Question

The CUDA programming guide introduced the concept of warp vote function, "_all", "_any" and "__ballot".

My question is: what applications will use these 3 functions?

score 6 · Accepted Answer

的原型__ballot如下

unsigned int __ballot(int predicate);

如果predicate为非零，则__ballot返回一个设置了N第 th 位的值，其中N是线程索引。

atomicOr与和结合使用__popc，可用于累积每个经纱中具有真谓词的线程数。

的确，原型atomicOr是

int atomicOr(int* address, int val);

并atomicOr读取指向的值address，对执行按位OR运算val，然后将值写回address并返回其旧值作为返回参数。

另一方面，__popc返回使用32-bit 参数设置的位数。

因此，指令

volatile __shared__ u32 warp_shared_ballot[MAX_WARPS_PER_BLOCK];

const u32 warp_sum = threadIdx.x >> 5;

atomicOr(&warp_shared_ballot[warp_num],__ballot(data[tid]>threshold));

atomicAdd(&block_shared_accumulate,__popc(warp_shared_ballot[warp_num]));

可用于计算谓词为真的线程数。

有关更多详细信息，请参阅 Shane Cook、CUDA 编程、Morgan Kaufmann

score 5 · Accepted Answer

__ballot在CUDA 直方图和 CUDA NPP 库中用于快速生成位掩码，并将其与__popc内在函数相结合以非常有效地实现布尔缩减。

__all并__any在引入之前用于归约__ballot，尽管我想不出它们的任何其他用途。

score 1 · Accepted Answer

作为使用 __ballot API 的算法示例，我会提到 DM Hughes 等人的 In-Kernel Stream Compaction。它用于流压缩的前缀和部分，以计算（每个扭曲）通过谓词的元素数量。

这里是纸。In-k 流压缩

score 0 · Accepted Answer

CUDA 提供了 NVIDIA 架构有效支持的几个 warp-wide 广播和缩减操作。例如， __ballot(predicate) 指令为 warp 的所有活动线程计算 predicate 并返回一个整数，当且仅当 predicate 对warp 的第 N 个线程和第 N个线程计算为非零时处于活动状态 [参考：GPU 架构的灵活软件分析]。

cuda - About warp voting function

4 回答 4

Related

Reference