c++ - CUB (CUDA UnBound) 等效于推力::聚集

Question

由于 Thrust 库的一些性能问题（有关更多详细信息，请参阅此页面），我计划重构 CUDA 应用程序以使用 CUB 而不是 Thrust。具体来说，替换thrust::sort_by_key 和thrust::inclusive_scan 调用）。在我的应用程序的某个特定点，我需要按键对 3 个数组进行排序。这就是我用推力做到这一点的方式：

thrust::sort_by_key(key_iter, key_iter + numKeys, indices);
thrust::gather_wrapper(indices, indices + numKeys, 
      thrust::make_zip_iterator(thrust::make_tuple(values1Ptr, values2Ptr, values3Ptr)),
      thrust::make_zip_iterator(thrust::make_tuple(valuesOut1Ptr, valuesOut2Ptr, valuesOut3Ptr))
);

在哪里

key iter是一个推力::device_ptr 指向我想要排序的键
indices指向设备内存中的一个序列（从 0 到 numKeys-1）
values{1,2,3}Ptr是我想要排序的值的 device_ptrs
values{1,2,3}OutPtr是排序值的 device_ptrs

使用CUB SortPairs函数，我可以对单个值缓冲区进行排序，但不能一次性对所有 3 个值进行排序。问题是我没有看到任何 CUB“类似收集”的实用程序。建议？

编辑：

我想我可以实现我自己的收集内核，但是除了：

template <typename Index, typename Value> 
__global__ void  gather_kernel(const unsigned int N, const Index * map, 
const Value * src, Value * dst) 
{ 
    unsigned int i = blockDim.x * blockIdx.x + threadIdx.x; 
    if (i < N) 
    { 
        dst[i] = src[map[i]]; 
    } 
}

未合并的负载和存储让我感到厌烦，但如果没有已知的结构，这可能是不可避免的map。

score 4 · Accepted Answer

看来您要实现的目标取决于thrust::zip_iterator. 你也可以

仅替换thrust::sort_by_key为cub::DeviceRadixSort::SortPairs并保留thrust::gather，或
使用前压缩values{1,2,3}成结构数组cub::DeviceRadixSort::SortPairs

更新

在阅读了的实现之后thrust::gather，

$CUDA_HOME/include/thrust/system/detail/generic/gather.inl

你可以看到它只是一个幼稚的内核

__global__ gather(int* index, float* in, float* out, int len) {
  int i=...;
  if (i<len) { out[i] = in[index[i]]; }
}

那么我认为您上面的代码可以不用太多努力就可以替换为单个内核。

在这个内核中，您可以首先使用 CUB block-wize 原语cub::BlockRadixSort<...>::SortBlockedToStriped来获取存储在寄存器中的排序索引，然后执行一个简单的重新排序复制thrust::gather来填充values{1,2,3}Out。

在复制. SortBlockedToStriped_Sortvalues

c++ - CUB (CUDA UnBound) 等效于推力::聚集

1 回答 1

更新

Related

Reference