如果我使用 nsight cuda 分析器或直接从终端运行它,我有一个运行良好的 Cuda 内核。但是如果我使用这个命令
cuda-memcheck --leak-check full ./CudaTT 1 ../../file.jpg
它因“未指定的启动失败”而崩溃。我在每个内核代码之后都使用它。
e=cudaDeviceSynchronize();
if (e != cudaSuccess) printf("Fail in kernel 2 %s",cudaGetErrorString(e));
并且 cuda-memcheck 显示了其中的几个
========= Program hit error 4 on CUDA API call to cudaDeviceSynchronize
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x24e129]
========= Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaDeviceSynchronize + 0x214) [0x27e24]
=========
========= Program hit error 4 on CUDA API call to cudaFree
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x24e129]
========= Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaFree + 0x228) [0x338b8]
最后它显示
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 10 errors
知道为什么会这样吗?
编辑:
我注释掉了另一个由于有很多寄存器而没有启动的内核,现在上面的内核上的错误现在发生了变化,它说:“启动超时并被终止”。它再次在 cuda 分析器上运行正常,并且在终端上没有 cuda-memcheck 但是当使用 cuda-memcheck 时它显示了这个
========= Program hit error 6 on CUDA API call to cudaDeviceSynchronize
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x24e129]
========= Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaDeviceSynchronize + 0x214) [0x27e24]
=========
========= Program hit error 6 on CUDA API call to cudaFree
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/usr/lib/libcuda.so [0x24e129]
========= Host Frame:/usr/local/cuda-5.0/lib/libcudart.so.5.0 (cudaFree + 0x228) [0x338b8]
========= Host Frame:[0xbf913ea8]
最后还是同样的10个错误
========= LEAK SUMMARY: 0 bytes leaked in 0 allocations
========= ERROR SUMMARY: 10 errors
错误 6 似乎是由于内核超时持续时间过长,但是如果没有 cuda-memcheck,它是如何工作的?在分析器上,它显示内核持续了 3.771 秒。
另一个奇怪的行为是我在计算后打印了一些值。如果我使用 cuda-memcheck 与不使用,值会有所不同。