cuda - CUDA - atomicAdd 仅加起来 16777216

Question

在运行以下内核时，我遇到了以下易于重现的问题，除了浮点数的 atomicAdds 之外什么都不做：

#define OUT_ITERATIONS 20000000
#define BLOCKS 12
#define THREADS 192

__global__ void testKernel(float* result) {
    int i = threadIdx.x + blockIdx.x * blockDim.x;
    float bias = 1.0f;
    int n = 1;

    while (i < OUT_ITERATIONS) {
        atomicAdd(result, bias);
        i += BLOCKS * THREADS;
    }
}

内核应该将结果增加 OUT_ITERATIONS 次，即 20M。我用这个标准代码调用内核：

int main() {
cudaError_t cudaStatus;
float* result;
float* dev_result;

// Choose which GPU to run on, change this on a multi-GPU system.
cudaStatus = cudaSetDevice(0);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaSetDevice failed!  Do you have a CUDA-capable GPU installed?");
    goto Error;
}

result = new float;
cudaStatus = cudaMalloc((void**)&dev_result, sizeof(float));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMalloc failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}
cudaStatus = cudaMemset(dev_result, 0, sizeof(float));
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemset failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}

// Launch a kernel on the GPU with one thread for each element.
testKernel<<<BLOCKS, THREADS>>>(dev_result);

// Check for any errors launching the kernel
cudaStatus = cudaGetLastError();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "addKernel launch failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}

// cudaDeviceSynchronize waits for the kernel to finish, and returns
// any errors encountered during the launch.
cudaStatus = cudaDeviceSynchronize();
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaDeviceSynchronize returned error code %d after launching addKernel!\n", cudaStatus);
    goto Error;
}

cudaStatus = cudaMemcpy(result, dev_result, sizeof(float), cudaMemcpyDeviceToHost);
if (cudaStatus != cudaSuccess) {
    fprintf(stderr, "cudaMemcpy failed: %s\n", cudaGetErrorString(cudaStatus));
    goto Error;
}

printf("Result: %f\n", *result);

但是，最后打印的结果是 16777216.0，顺便说一下 0x1000000 十六进制。如果 OUT_ITERATIONS < 16777216 则不会出现问题，也就是说，如果我将其更改为 16777000 例如，输出果然是 16777000.0！

系统：NVidia-Titan、CUDA 5.5、Windows7

score 7 · Accepted Answer

这个问题是由于 type 的精度有限float。

float只有 24 位二进制精度。如果您添加 2 个数字，其中一个2^24-1比另一个大 1 倍以上，则结果将与较大的数字完全相同。

当您将像 16777216.0(=2^24) 这样的大数与像 1.0 这样的小数相加时，您会失去一些精度，结果仍然是 16777216.0。同样的情况发生在标准的 C 程序中

float a=16777216.0f;
float b=1.0f;
printf("%f\n",a+b);

您可以替换float为double或int解决此问题。

double版本的实现请参考cuda docatomicAdd()

http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions

score 4 · Accepted Answer

20M 不适合 a 中可用的整数精度float。

一个float数量没有 32 位尾数（您通过观察“顺便说一下 0x1000000 in hex”发现有多少尾数位），因此它不能以与 aint或unsigned int可以相同的方式表示所有整数。

16777216 是可以可靠存储在float.

将您的存储范围限制在适合的范围内float，或者使用其他表示形式，例如unsigned int或者double如果您想可靠地将 20M 存储为整数。

这不是真正的 CUDA 问题。尝试将大整数存储float在主机代码中时也会遇到类似的困难。

cuda - CUDA - atomicAdd 仅加起来 16777216

2 回答 2

Related

Reference