cuda - 减少 CUDA

Question

我刚开始学习 CUDA 编程，我对减少有些困惑。

我知道全局内存与共享内存相比有很多访问延迟，但是我可以使用全局内存（至少）模拟类似于共享内存的行为吗？

例如，我想对一个长度正好为BLOCK_SIZE * THREAD_SIZE（网格和块的维度都是的幂2）的大数组的元素求和，我尝试使用下面的代码：

    __global__ void parallelSum(unsigned int* array) {

    unsigned int totalThreadsNum = gridDim.x * blockDim.x;
    unsigned int idx = blockDim.x * blockIdx.x + threadIdx.x;

    int i = totalThreadsNum / 2;
    while (i != 0) {
            if (idx < i) {
                array[idx] += array[idx + i];
        }
        __syncthreads();
        i /= 2;
    }
}

我对比了这段代码的结果和在主机上串行生成的结果，奇怪的是：有时结果是一样的，但有时却明显不同。在这里使用全局内存有什么原因吗？

score 5 · Accepted Answer

汤姆已经回答了这个问题。在他的回答中，他建议使用Thrust或CUB来减少 CUDA。

在这里，我提供了一个完整的示例，说明如何使用这两个库来执行缩减。

#define CUB_STDERR

#include <stdio.h>

#include <thrust/device_ptr.h>
#include <thrust/reduce.h>
#include <thrust/execution_policy.h>

#include <cub/device/device_reduce.cuh>

#include "TimingGPU.cuh"
#include "Utilities.cuh"

using namespace cub;

/********/
/* MAIN */
/********/
int main() {

    const int N = 8388608;

    gpuErrchk(cudaFree(0));

    float *h_data       = (float *)malloc(N * sizeof(float));
    float h_result = 0.f;

    for (int i=0; i<N; i++) {
        h_data[i] = 3.f;
        h_result = h_result + h_data[i];
    }

    TimingGPU timerGPU;

    float *d_data;          gpuErrchk(cudaMalloc((void**)&d_data, N * sizeof(float)));
    gpuErrchk(cudaMemcpy(d_data, h_data, N * sizeof(float), cudaMemcpyHostToDevice));

    /**********/
    /* THRUST */
    /**********/
    timerGPU.StartCounter();
    thrust::device_ptr<float> wrapped_ptr = thrust::device_pointer_cast(d_data);
    float h_result1 = thrust::reduce(wrapped_ptr, wrapped_ptr + N);
    printf("Timing for Thrust = %f\n", timerGPU.GetCounter());

    /*******/
    /* CUB */
    /*******/
    timerGPU.StartCounter();
    float           *h_result2 = (float *)malloc(sizeof(float));
    float           *d_result2; gpuErrchk(cudaMalloc((void**)&d_result2, sizeof(float)));
    void            *d_temp_storage = NULL;
    size_t          temp_storage_bytes = 0;

    DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_data, d_result2, N);
    gpuErrchk(cudaMalloc((void**)&d_temp_storage, temp_storage_bytes));
    DeviceReduce::Sum(d_temp_storage, temp_storage_bytes, d_data, d_result2, N);

    gpuErrchk(cudaMemcpy(h_result2, d_result2, sizeof(float), cudaMemcpyDeviceToHost));

    printf("Timing for CUB = %f\n", timerGPU.GetCounter());

    printf("Results:\n");
    printf("Exact: %f\n", h_result);
    printf("Thrust: %f\n", h_result1);
    printf("CUB: %f\n", h_result2[0]);

}

请注意，由于不同的底层哲学，CUB 可能比 Thrust 快一些，因为 CUB 留下了性能关键的细节，例如算法的确切选择和并发程度不受约束并掌握在用户手中。通过这种方式，可以调整这些参数，以最大限度地提高特定架构和应用程序的性能。

CUB in Action报告了计算数组欧几里得范数的比较——一些使用 CUB 模板库的简单示例。

score 4 · Accepted Answer

最好的办法是从CUDA 示例中的缩减示例开始。扫描示例也有助于学习吞吐量架构上的并行计算原理。

也就是说，如果您实际上只是想在代码中使用归约运算符，那么您应该查看Thrust（从主机调用，跨平台）和CUB（特定于 CUDA GPU）。

查看您的具体问题：

没有理由不能使用全局内存进行缩减，工具包中的示例代码介绍了各个级别的优化，但在每种情况下，数据都从全局内存中开始。
您的代码效率低下（有关工作效率的更多详细信息，请参见工具包中的示例！）。
您的代码试图在没有适当同步的情况下在不同块中的线程之间进行通信；__syncthreads()仅在特定块内同步线程，而不是跨不同块同步线程（至少一般来说这是不可能的，因为您倾向于超额订阅 GPU，这意味着并非所有块都将在给定时间运行）。

最后一点是最重要的。如果块 X 中的线程想要读取块 Y 写入的数据，那么您需要在两次内核启动时打破这一点，这就是为什么典型的并行减少采用多阶段方法：减少块内的批次，然后在批次之间减少.

cuda - 减少 CUDA

2 回答 2

Related

Reference