cuda - CUDA 点积

Question

我正在尝试为双精度数组实现经典的点积内核，并对各个块的最终总和进行原子计算。我使用 atomicAdd 进行双精度，如编程指南第 116 页所述。可能我做错了。每个块中线程的部分总和计算正确，但原子操作似乎无法正常工作因为每次我用相同的数据运行我的内核时，我都会收到不同的结果。如果有人能发现错误或提供替代解决方案，我将不胜感激！这是我的内核：

__global__ void cuda_dot_kernel(int *n,double *a, double *b, double *dot_res)
{
    __shared__ double cache[threadsPerBlock]; //thread shared memory
    int global_tid=threadIdx.x + blockIdx.x * blockDim.x;
    int i=0,cacheIndex=0;
    double temp = 0;
    cacheIndex = threadIdx.x;
    while (global_tid < (*n)) {
        temp += a[global_tid] * b[global_tid];
        global_tid += blockDim.x * gridDim.x;
    }
    cache[cacheIndex] = temp;
    __syncthreads();
    for (i=blockDim.x/2; i>0; i>>=1) {
        if (threadIdx.x < i) {
            cache[threadIdx.x] += cache[threadIdx.x + i];
        }
        __syncthreads();
    }
    __syncthreads();
    if (cacheIndex==0) {
        *dot_res=cuda_atomicAdd(dot_res,cache[0]);
    }
}

这是我的设备函数 atomicAdd：

__device__ double cuda_atomicAdd(double *address, double val)
{
    double assumed,old=*address;
    do {
        assumed=old;
        old= __longlong_as_double(atomicCAS((unsigned long long int*)address,
                    __double_as_longlong(assumed),
                    __double_as_longlong(val+assumed)));
    }while (assumed!=old);

    return old;
}

score 9 · Accepted Answer

使用 ad hoc CUDA 代码获得减少权可能很棘手，所以这里有一个使用推力算法的替代解决方案，它包含在 CUDA 工具包中：

#include <thrust/inner_product.h>
#include <thrust/device_ptr.h>

double do_dot_product(int n, double *a, double *b)
{
  // wrap raw pointers to device memory with device_ptr
  thrust::device_ptr<double> d_a(a), d_b(b);

  // inner_product implements a mathematical dot product
  return thrust::inner_product(d_a, d_a + n, d_b, 0.0);
}

score 4 · Accepted Answer

您使用的cuda_atomicAdd功能不正确。内核的这一部分：

if (cacheIndex==0) {
    *dot_res=cuda_atomicAdd(dot_res,cache[0]);
}

是罪魁祸首。在这里，您以原子方式添加到dot_res. 然后用它返回的结果进行非原子设置。dot_res此函数的返回结果是被原子更新的位置的先前值，它仅用于“信息”或调用者的本地使用。您没有将它分配给您以原子方式更新的内容，这完全违背了首先使用原子内存访问的目的。改为执行以下操作：

if (cacheIndex==0) {
    double result=cuda_atomicAdd(dot_res,cache[0]);
}

score -1 · Accepted Answer

没有检查你的代码深度，但这里有一些建议。
如果您仅将 GPU 用于此类通用任务，我只会建议您使用 Thrust，因为如果出现复杂问题，人们不知道如何在 GPU 上有效地进行并行编程。

启动一个新的并行归约内核来总结点积。
由于数据已经在设备上，您不会看到启动新内核的性能下降。
您的内核似乎无法跨越最新 GPU 上的最大可能块数。如果它可以并且您的内核将能够计算数百万个值的点积，那么由于序列化的原子操作，性能将急剧下降。
初学者错误：是否检查了您的输入数据和共享内存访问范围？或者您确定输入数据始终是您的块大小的倍数？否则你会读垃圾。我的大部分错误结果都是由于这个错误造成的。
优化您的并行减少。我的论文或优化 Mark Harris

未经测试，我只是在记事本中写下来：

/*
 * @param inCount_s unsigned long long int Length of both input arrays
 * @param inValues1_g double* First value array
 * @param inValues2_g double* Second value array
 * @param outDots_g double* Output dots of each block, length equals the number of blocks
 */
__global__ void dotProduct(const unsigned long long int inCount_s,
    const double* inValuesA_g,
    const double* inValuesB_g,
    double* outDots_g)
{
    //get unique block index in a possible 3D Grid
    const unsigned long long int blockId = blockIdx.x //1D
            + blockIdx.y * gridDim.x //2D
            + gridDim.x * gridDim.y * blockIdx.z; //3D


    //block dimension uses only x-coordinate
    const unsigned long long int tId = blockId * blockDim.x + threadIdx.x;

    /*
     * shared value pair products array, where BLOCK_SIZE power of 2
     *
     * To improve performance increase its size by multiple of BLOCK_SIZE, so that each threads loads more then 1 element!
     * (outDots_g length decreases by same factor, and you need to range check and initialize memory)
     * -> see harris gpu optimisations / parallel reduction slides for more informations.
     */
    __shared__ double dots_s[BLOCK_SIZE];


    /*
     * initialize shared memory array and calculate dot product of two values, 
     * shared memory always needs to be initialized, its never 0 by default, else garbage is read later!
     */
    if(tId < inCount_s)
        dots_s[threadIdx.x] = inValuesA_g[tId] * inValuesB_g[tId];
    else
        dots_s[threadIdx.x] = 0;
    __syncthreads();

    //do parallel reduction on shared memory array to sum up values
    reductionAdd(dots_s, dots_s[0]) //see my thesis link

    //output value
    if(threadIdx.x == 0)
        outDots_g[0] = dots_s[0];

    //start new parallel reduction kernel to sum up outDots_g!
}

编辑：删除了不必要的点。

cuda - CUDA 点积

3 回答 3

Related

Reference