我已经在 NVIDIA 开发论坛上发布了我的问题,但还没有明确的答案。
我刚开始学习 CUDA,我真的很惊讶,与我在 Internet 上发现的相反,我的卡 (GeForce GTX 660M) 支持一些疯狂的网格尺寸 ( 2147483647 x 65535 x 65535)。请查看我从随工具包提供的 deviceQuery.exe 获得的以下结果:
c:\ProgramData\NVIDIA Corporation\CUDA Samples\v5.0\bin\win64\Release>deviceQuery.exe
deviceQuery.exe Starting...
CUDA Device Query (Runtime API) version (CUDART static linking)
Detected 1 CUDA Capable device(s)
Device 0: "GeForce GTX 660M"
CUDA Driver Version / Runtime Version 5.5 / 5.0
CUDA Capability Major/Minor version number: 3.0
Total amount of global memory: 2048 MBytes (2147287040 bytes)
( 2) Multiprocessors x (192) CUDA Cores/MP: 384 CUDA Cores
GPU Clock rate: 950 MHz (0.95 GHz)
Memory Clock rate: 2500 Mhz
Memory Bus Width: 128-bit
L2 Cache Size: 262144 bytes
Max Texture Dimension Size (x,y,z) 1D=(65536), 2D=(65536,65536), 3D=(4096,4096,4096)
Max Layered Texture Size (dim) x layers 1D=(16384) x 2048, 2D=(16384,16384) x 2048
Total amount of constant memory: 65536 bytes
Total amount of shared memory per block: 49152 bytes
Total number of registers available per block: 65536
Warp size: 32
Maximum number of threads per multiprocessor: 2048
Maximum number of threads per block: 1024
Maximum sizes of each dimension of a block: 1024 x 1024 x 64
Maximum sizes of each dimension of a grid: 2147483647 x 65535 x 65535
Maximum memory pitch: 2147483647 bytes
Texture alignment: 512 bytes
Concurrent copy and kernel execution: Yes with 1 copy engine(s)
Run time limit on kernels: Yes
Integrated GPU sharing Host Memory: No
Support host page-locked memory mapping: Yes
Alignment requirement for Surfaces: Yes
Device has ECC support: Disabled
CUDA Device Driver Mode (TCC or WDDM): WDDM (Windows Display Driver Model)
Device supports Unified Addressing (UVA): Yes
Device PCI Bus ID / PCI location ID: 1 / 0
Compute Mode:
< Default (multiple host threads can use ::cudaSetDevice() with device simultaneously) >
deviceQuery, CUDA Driver = CUDART, CUDA Driver Version = 5.5, CUDA Runtime Version = 5.0, NumDevs = 1, Device0 = GeForce GTX 660M
我很好奇编写了一个简单的程序来测试是否可以在网格的第一维中使用超过 65535 个块,但它无法确认我在 Internet 上找到的内容(或者,更准确地说,确实有效适用于 65535 块,不适用于 65536)。
我的程序非常简单,基本上只是添加了两个向量。这是源代码:
#include <cuda.h>
#include <cuda_runtime.h>
#include <device_launch_parameters.h>
#include <stdio.h>
#include <math.h>
#pragma comment(lib, "cudart")
typedef struct
{
float *content;
const unsigned int size;
} pjVector_t;
__global__ void AddVectorsKernel(float *firstVector, float *secondVector, float *resultVector)
{
unsigned int index = threadIdx.x + blockIdx.x * blockDim.x;
resultVector[index] = firstVector[index] + secondVector[index];
}
int main(void)
{
//const unsigned int vectorLength = 67107840; // 1024 * 65535 - works fine
const unsigned int vectorLength = 67108864; // 1024 * 65536 - doesn't work
const unsigned int vectorSize = sizeof(float) * vectorLength;
int threads = 0;
unsigned int blocks = 0;
cudaDeviceProp deviceProperties;
cudaError_t error;
pjVector_t firstVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
pjVector_t secondVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
pjVector_t resultVector = { (float *)calloc(vectorLength, sizeof(float)), vectorLength };
float *d_firstVector;
float *d_secondVector;
float *d_resultVector;
cudaMalloc((void **)&d_firstVector, vectorSize);
cudaMalloc((void **)&d_secondVector, vectorSize);
cudaMalloc((void **)&d_resultVector, vectorSize);
cudaGetDeviceProperties(&deviceProperties, 0);
threads = deviceProperties.maxThreadsPerBlock;
blocks = (unsigned int)ceil(vectorLength / (double)threads);
for (unsigned int i = 0; i < vectorLength; i++)
{
firstVector.content[i] = 1.0f;
secondVector.content[i] = 2.0f;
}
cudaMemcpy(d_firstVector, firstVector.content, vectorSize, cudaMemcpyHostToDevice);
cudaMemcpy(d_secondVector, secondVector.content, vectorSize, cudaMemcpyHostToDevice);
AddVectorsKernel<<<blocks, threads>>>(d_firstVector, d_secondVector, d_resultVector);
error = cudaPeekAtLastError();
cudaMemcpy(resultVector.content, d_resultVector, vectorSize, cudaMemcpyDeviceToHost);
for (unsigned int i = 0; i < vectorLength; i++)
{
if(resultVector.content[i] != 3.0f)
{
free(firstVector.content);
free(secondVector.content);
free(resultVector.content);
cudaFree(d_firstVector);
cudaFree(d_secondVector);
cudaFree(d_resultVector);
cudaDeviceReset();
printf("Error under index: %i\n", i);
return 0;
}
}
free(firstVector.content);
free(secondVector.content);
free(resultVector.content);
cudaFree(d_firstVector);
cudaFree(d_secondVector);
cudaFree(d_resultVector);
cudaDeviceReset();
printf("Everything ok!\n");
return 0;
}
当我在调试模式(更大的向量)下从 Visual Studio 运行它时,最后一个 cudaMemcpy 总是用看似随机的数据填充我的 resultVector(如果重要的话,非常接近 0),因此结果不会通过最终验证。当我尝试使用 Visual Profiler 对其进行分析时,它返回以下错误消息:
2 个事件、0 个指标和 0 个源级指标与内核无关,不会显示
因此,分析器仅测量cudaMalloc和cudaMemcpy操作,甚至不显示内核执行。
我不确定我检查 cuda erros 是否正确,所以请告诉我是否可以做得更好。cudaPeekAtLastError()放置在我的内核启动之后,当使用更大的向量时返回cudaErrorInvalidValue(11)错误,而对于所有其他调用(cudaMalloc和cudaMemcpy )返回cudaSuccess(0 ) 。当我使用较小的向量运行我的程序时,所有 cuda 函数和我的内核启动都不会返回错误(cudaSuccess(0))并且它工作得很好。
所以我的问题是:cudaGetDeviceProperties 返回垃圾网格大小值还是我做错了什么?