cuda - Sum reduction with CUB

Question

According to this article, sum reduction with CUB Library should be one of the fastest way to make parallel reduction. As you can see in a code fragment below, the execution time is measure excluding first cub::DeviceReduce::Reduce(temp_storage, temp_storage_bytes, in, out, N, cub::Sum()); I assume that it's something connected with memory preparation and when we reduce several times the same data it isn't neccesary to call it every time but when I've got many different arrays with the same number of elements and type of data do I have to do it every time? If the answer is yes, it means that usage of CUB Library becomes pointless.

  size_t temp_storage_bytes;
  int* temp_storage=NULL;
  cub::DeviceReduce::Reduce(temp_storage, temp_storage_bytes, in, out, N, cub::Sum());
  cudaMalloc(&temp_storage,temp_storage_bytes);

  cudaDeviceSynchronize();
  cudaCheckError();
  cudaEventRecord(start);

  for(int i=0;i<REPEAT;i++) {
    cub::DeviceReduce::Reduce(temp_storage, temp_storage_bytes, in, out, N, cub::Sum());
  }
  cudaEventRecord(stop);
  cudaDeviceSynchronize();

score 2 · Accepted Answer

我认为这与内存准备有关，当我们减少多次相同的数据时，不必每次都调用它

这是正确的。

但是当我有许多不同的数组具有相同数量的元素和数据类型时，我每次都必须这样做吗？

不，您不需要每次都这样做。“第一次”调用 cub::DeviceReduce::Reduce（即何时temp_storage=NULL）的唯一目的是提供 CUB 所需的临时存储所需的字节数。如果您的数据的类型和大小没有改变，则无需重新运行此步骤或后续cudaMalloc操作。只要数据的大小和类型相同，您就可以简单地再次调用cub::DeviceReduce::Reduce（temp_storage指向由提供的先前分配）您的“新”数据。cudaMalloc

cuda - Sum reduction with CUB

1 回答 1

Related

Reference