cuda - 推力::count 的表现

Question

我将以下代码作为数据重组的一部分，以供以后在 CUDA 内核中使用：

thrust::device_ptr<int> dev_ptr = thrust::device_pointer_cast(dev_particle_cell_indices);
int total = 0;
for(int i = 0; i < num_cells; i++) {
    particle_offsets[i] = total;
    // int num = 0;
    int num = thrust::count(dev_ptr, dev_ptr + num_particles, i);
    particle_counts[i] = num;
    total += num;
}

现在，如果我设置num为 0（取消注释第 5 行，并注释掉第 6 行），应用程序以超过 30 fps 的速度运行，这是我的目标。但是，当我设置num为等于thrust::count调用时，帧率下降到大约 1-2 fps。为什么会这样？

我的理解是，推力应该是一组高度优化的算法，这些算法利用了 GPU 的力量，所以我很惊讶它会对我的程序性能产生这种影响。这是我第一次使用推力，所以我可能不知道一些重要的细节。

在循环中使用thrust::count是否会导致它运行如此缓慢？我怎样才能优化我的使用呢？

给出一些数字，在我目前的测试用例中，num_particles大约是 2000，num_cells大约是 1500。

score 7 · Accepted Answer

的性能thrust::count很好，这是您尝试使用它的方式对性能有问题。如果您有很多粒子并且只有几个单元格，那么您的实现使用thrust::count可能不是一个坏主意。你的问题是你有 1500 个细胞。count这意味着每次您想要进行计算时，需要1500 次调用和 1500 次设备到主机内存传输。正如您所发现的，所有内核启动和所有 PCI-e 总线副本的延迟都会降低性能。

对于大量单元格的更好方法是这样的：

thrust::device_ptr<int> rawin = thrust::device_pointer_cast(dev_particle_cell_indices);

// Sort a scratch copy of the cell indices by value
thrust::device_vector<int> cidx(num_particles);
thrust::copy(rawin, rawin+num_particles, cidx.begin());
thrust::sort(cidx.begin(), cidx.end());

// Use binary search to extract all the cell counts/offsets
thrust::counting_iterator<int> cellnumber(0);
thrust::device_vector<int> offsets(num_cells), counts(num_cells);

// Offsets come from lower_bound of the ordered cell numbers
thrust::lower_bound(cidx.begin(), cidx.end(), cellnumber, cellnumber+num_cells, offsets.begin());

// Counts come from the adjacent_difference of the upper_bound of the ordered cell numbers
thrust::upper_bound(cidx.begin(), cidx.end(), cellnumber, cellnumber+num_cells, counts.begin());
thrust::adjacent_difference(counts.begin(), counts.end(), counts.begin());

// Copy back to the host pointer
thrust::copy(counts.begin(), counts.end(), particle_counts);
thrust::copy(offsets.begin(), offsets.end(), particle_offsets);

在这里，我们首先对单元格索引的本地副本进行排序，然后使用推力二分搜索函数执行与您的代码相同的操作，但通过 GPU 内存中的数据要少得多，并且只需两个设备到主机内存副本即可获得将所有结果返回给主机。

当我thrust::count使用上面发布的代码对您的实现进行基准测试时（在 OS X 上使用 CUDA 4.1 的 GeForce 320M 上的 10000 个随机粒子和 2000 个单元），我发现您的版本运行大约需要 0.95 秒，而排序/搜索版本运行大约需要 0.003 秒。因此，如果您使用更有效的策略和更合适的算法，则使用推力可能会获得数百倍的加速。

cuda - 推力::count 的表现

1 回答 1

Related

Reference