sorting - 为什么连续推力排序和减少命令之间在 GPU 上没有活动？

Question

请参考下面显示我的 CUDA 代码的 Nvidia Visual Profiler 会话的两个快照：

nvprof 会话的快照显示了thrust::sort 和thrust::reduce 调用执行时间线

突出显示 sort 和 reduce 调用以显示所用时间和执行之间的差距

你可以看到两次调用之间有大约 70 us 的差距，那么第一次和第二次调用thrust::sort()之间就有很大的差距。快照中总共有大约 300 us 的此类间隙可见。我相信这些是“空闲”时间，也许是推力库引入的。无论如何，我找不到任何相关的讨论，或者 Nvidia 的文档。有人可以解释为什么我有如此明显的“空闲”时间吗？加起来，这样的时间占我应用程序执行时间的 40%，所以这对我来说是个大问题！thrust::reduce()thrust::sort()

此外，我测量到我编写的对连续 cuda 内核的调用之间的差距大约只有 3 us！

我编写了一个示例 cuda 代码以便在此处发布：

void profileThrustSortAndReduce(const int ARR_SIZE) {
    // for thrust::reduce on first 10% of the sorted array
    const int ARR_SIZE_BY_10 = ARR_SIZE / 10;

    // generate host random arrays of float
    float* h_arr1;          cudaMallocHost((void **)&h_arr1, ARR_SIZE * sizeof(float));
    float* h_arr2;          cudaMallocHost((void **)&h_arr2, ARR_SIZE * sizeof(float));
    for (int i = 0; i < ARR_SIZE; i++) {
        h_arr1[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX)* 1000.0f;
        h_arr2[i] = static_cast <float> (rand()) / static_cast <float> (RAND_MAX)* 1000.0f;
    }

    // device arrays populated
    float* d_arr1;          cudaMalloc((void **)&d_arr1, ARR_SIZE * sizeof(float));
    float* d_arr2;          cudaMalloc((void **)&d_arr2, ARR_SIZE * sizeof(float));
    cudaMemcpy(d_arr1, h_arr1, ARR_SIZE * sizeof(float), cudaMemcpyHostToDevice);
    cudaMemcpy(d_arr2, h_arr2, ARR_SIZE * sizeof(float), cudaMemcpyHostToDevice);

    // start cuda profiler
    cudaProfilerStart();

    // sort the two device arrays
    thrust::sort(thrust::device, d_arr1, d_arr1 + ARR_SIZE);
    thrust::sort(thrust::device, d_arr2, d_arr2 + ARR_SIZE);

    // mean of 100 percentiles of device array
    float arr1_red_100pc_mean = thrust::reduce(thrust::device, d_arr1, d_arr1 + ARR_SIZE) / ARR_SIZE;
    // mean of smallest 10 percentiles of device array
    float arr1_red_10pc_mean = thrust::reduce(thrust::device, d_arr1, d_arr1 + ARR_SIZE_BY_10) / ARR_SIZE_BY_10;

    // mean of 100 percentiles of device array
    float arr2_red_100pc_mean = thrust::reduce(thrust::device, d_arr2, d_arr2 + ARR_SIZE) / ARR_SIZE;
    // mean of smallest 10 percentiles of device array
    float arr2_red_10pc_mean = thrust::reduce(thrust::device, d_arr2, d_arr2 + ARR_SIZE_BY_10) / ARR_SIZE_BY_10;

    // stop cuda profiler
    cudaProfilerStop();
}

此示例函数的 nvprof 会话的快照

score 2 · Accepted Answer

这些差距主要是由cudaMalloc操作引起的。 thrust::sort并且大概thrust::reduce分配（和释放）与其活动相关的临时存储。

您已将时间线的这一部分从您粘贴到问题中的前 2 张图片中剪掉，但在您在第三张图片中显示的时间线部分的正上方，您会cudaMalloc在“运行时 API”分析器行中找到操作.

这些cudaMalloc（和cudaFree）操作非常耗时且需要同步。要解决这个问题，典型的建议是使用推力自定义分配器（也在此处）。这样做，您可以在程序开始时为所需的必要大小分配一次，而不必在每次进行推力调用时产生分配/释放开销。

或者，您可以探索cub，它已经为您分离了分配和处理步骤。

sorting - 为什么连续推力排序和减少命令之间在 GPU 上没有活动？

1 回答 1

Related

Reference