c++ - 分配一次调用 cudaMalloc 的两个数组

Question

内存分配是 GPU 中最耗时的操作之一，因此我想通过cudaMalloc使用以下代码调用一次来分配 2 个数组：

int numElements = 50000;
size_t size = numElements * sizeof(float);

//declarations-initializations
float *d_M = NULL;
err = cudaMalloc((void **)&d_M, 2*size);
//error checking

// Allocate the device input vector A
float *d_A = d_M;


// Allocate the device input vector B
float *d_B = d_M + size;

err = cudaMemcpy(d_A, h_A, size, cudaMemcpyHostToDevice);
//error checking

err = cudaMemcpy(d_B, h_B, size, cudaMemcpyHostToDevice);
//error checking

原始代码位于名为 vectorAdd.cu 的 cuda 工具包的示例文件夹中，因此您可以假设 h_A、h_B 已正确启动，并且代码无需我进行修改即可工作。
结果是第二个 cudaMemcpy 返回了一个带有 message invalid argument的错误。

似乎操作“d_M + size”没有返回人们期望的结果，因为设备内存的行为不同，但我不知道如何。

是否有可能使我的方法（一次调用 cudaMalloc 为两个数组分配内存）工作？也欢迎任何关于这是否是一种好方法的评论/答案。

更新
正如Robert和dreamcrash的回答所建议的那样，我必须将元素数（numElements）添加到指针 d_M 而不是字节数的大小。仅供参考，没有可观察到的加速。

score 4 · Accepted Answer

你只需要更换

float *d_B = d_M + size;

和

float *d_B = d_M + numElements;

这是指针算法，如果你有一个浮点数组，R = [1.0,1.2,3.3,3.4]你可以通过printf("%f",*R);. 第二个位置呢？你就这样printf("%f\n",*(++R));做r[0] + 1。你不做r[0] + sizeof(float)，就像你做的那样。当你这样做时，r[0] + sizeof(float)你将访问位置中的元素，r[4]因为size(float) = 4。

When you declare float *d_B = d_M + numElements; the compiler assumes that d_b will be continuously allocated in memory, and each element will have a size of a float. Hence, you do not need to specify the distance in terms of bytes but rather in terms of elements, the compiler will do the math for you. This approach is more human-friendly since it is more intuitive to express the pointer arithmetic in terms of elements than in terms of bytes. Moreover, it is also more portable, since if the number of bytes of a given type changes based on the underneath architecture, the compiler will handle that for you. Consequently, one's code will not break because one assumed a fixed byte size.

You said that "The result was that the second cudaMemcpy returned an error with message invalid argument":

If you print the number corresponding to this error, it will print 11 and if you check the CUDA API you verify that this error corresponds to :

cudaErrorInvalidValue

This indicates that one or more of the parameters passed to the API call is not within an acceptable range of values.

In your example means that float *d_B = d_M + size; is getting out of the range.

You have allocate space for 100000 floats, d_a will start from 0 to 50000, but according to your code d_b will start from numElements * sizeof(float); 50000 * 4 = 200000, since 200000 > 100000 you are getting invalid argument.

c++ - 分配一次调用 cudaMalloc 的两个数组

1 回答 1

Related

Reference