编写了一个计算简单函数积分的程序。在测试它时,我发现如果我使用一个大小超过 1000 万个元素的数组,它会产生错误的答案。我发现一旦在 CUDA 内核中操作了数组,似乎就会发生错误。1000 万个及以下的元素工作正常并产生了正确的结果。
可以传输到 GPU 或在 GPU 上计算的元素数量是否有大小限制?
PS 使用包含浮点数的 C 样式数组。
编写了一个计算简单函数积分的程序。在测试它时,我发现如果我使用一个大小超过 1000 万个元素的数组,它会产生错误的答案。我发现一旦在 CUDA 内核中操作了数组,似乎就会发生错误。1000 万个及以下的元素工作正常并产生了正确的结果。
可以传输到 GPU 或在 GPU 上计算的元素数量是否有大小限制?
PS 使用包含浮点数的 C 样式数组。
There are many different kinds of memory that you can use with CUDA. In particular, you have
cuMemAlloc
)cuMemHostAlloc
)cuMemAllocHost
)cuMemAllocPitch
)Each kind of memory is associated with its own hardware resource limits, many of which you can find by using cuDeviceGetAttribute
. The function cuMemGetInfo
returns the amount of free and total memory on the device, but because of alignment requirements, allocating 1,000,000
floats may result in more than 1,000,000 * sizeof(float)
bytes being consumed. The maximum number of blocks that you can schedule at once is also a limitation: if you exceed it, the kernel will fail to launch (you can easily find this number using cuDeviceGetAttribute
). You can find out the alignment requirements for different amounts of memory using the CUDA Driver API, but for a simple program, you can make a reasonable guess and check the value of allocation function to determine whether the allocation was successful.
There is no restriction on the amount of bytes that you can transfer; using asynchronous functions, you can overlap kernel execution with memory copying (providing that your card supports this). Exceeding the maximum number of blocks you can schedule, or consuming the available memory on your device means that you will have to split up your task so that you can use multiple kernels to handle it.
对于计算能力> = 3.0,最大网格尺寸为 2147483647x65535x65535,因此对于一个应涵盖大小高达 2147483647x1024 = 2.1990233e+12 的任何一维数组。
10 亿个元素数组绝对没问题。
1,000,000,000/1024=976562.5,向上取整到 976563 个块。只要确保如果 threadIdx.x+blockIdx.x*blockDim.x>= 您从内核返回而不进行处理的元素数。