这是我一直在研究的一个简单的测试程序(以帮助调试我在运行求和函数上的工作),我似乎找不到什么问题。该程序只是在一个小列表上调用我的运行求和函数并尝试打印出数据。造成所有麻烦的那一行是被注释掉的那一行。它是 cudaMemcpy(DeviceToHost)。当该行是代码的一部分时,我得到的错误是:
CUDA error at: student_func.cu:136 unspecified launch failure
cudaGetLastError() terminate called after throwing an instance of
'thrust::system::system_error' what(): unload of CUDA runtime failed
我根本不知道这有什么问题,它让我发疯。我尝试使用具有相同结果的常规旧 malloc。我已经确认输入数据可以很好地复制到设备数组(通过在内核中打印),但根本无法将结果从设备复制回主机。我真的很感激任何帮助!提前致谢 :)
unsigned int numElems = 100;
unsigned int blockLength = min( (unsigned int) 1024, (unsigned int) numElems);
unsigned int gridLength = ceil ( (float) numElems / (float) blockLength );
unsigned int* d_in;
unsigned int* h_in;
checkCudaErrors(cudaMallocHost(&h_in, sizeof(unsigned int) * numElems));
for (int i = 0; i < numElems; i++)
{
h_in[i] = i;
}
checkCudaErrors(cudaMalloc(&d_in, sizeof(unsigned int) * numElems));
checkCudaErrors(cudaMemcpy(d_in, h_in, sizeof(unsigned int) * numElems, cudaMemcpyHostToDevice));
exclusive_running_sum<<< gridLength, blockLength >>>(d_in, d_in, numElems);
cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
//this line is a problem!!
//checkCudaErrors(cudaMemcpy(h_in, d_in, sizeof(unsigned int) * numElems, cudaMemcpyDeviceToHost));
for (int i = 0; i < numElems; i++)
{
printf("%i %i\n", i, h_in[i]);
}