我认为你只是在这个例子中被优化绊倒了,正如已经建议的那样。
这是我的测试用例:
$ cat t1130.cu
#include <stdio.h>
__global__ void addKernel(int *c)
{
int x[1];
for (int i(0); i < 100; ++i) {
x[i] = i;
#ifdef FORCE
printf("%d ", i);
#endif
}
int t(0);
for (int i(0); i < 100; ++i) {
t += x[i];
}
c[0] = t;
}
int main(){
int *d_c;
cudaMalloc(&d_c, sizeof(int));
addKernel<<<1,1>>>(d_c);
cudaDeviceSynchronize();
}
$ nvcc -o t1130 t1130.cu
$ cuda-memcheck ./t1130
========= CUDA-MEMCHECK
========= ERROR SUMMARY: 0 errors
$ nvcc -DFORCE -o t1130 t1130.cu
$ cuda-memcheck ./t1130
========= CUDA-MEMCHECK
========= Invalid __local__ write of size 4
========= at 0x00000168 in addKernel(int*)
========= by thread (0,0,0) in block (0,0,0)
========= Address 0x00fffd10 is out of bounds
========= Saved host backtrace up to driver entry point at kernel launch time
========= Host Frame:/lib64/libcuda.so.1 (cuLaunchKernel + 0x2cd) [0x15865d]
========= Host Frame:./t1130 [0x16ca1]
========= Host Frame:./t1130 [0x314b3]
========= Host Frame:./t1130 [0x27a1]
========= Host Frame:./t1130 [0x269c]
========= Host Frame:./t1130 [0x26b6]
========= Host Frame:./t1130 [0x2600]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21d65]
========= Host Frame:./t1130 [0x2489]
=========
0 1 2 3 ========= Program hit cudaErrorLaunchFailure (error 4) due to "unspecified launch failure" on CUDA API call to cudaDeviceSynchronize.
========= Saved host backtrace up to driver entry point at error
========= Host Frame:/lib64/libcuda.so.1 [0x2f31b3]
========= Host Frame:./t1130 [0x354a6]
========= Host Frame:./t1130 [0x2605]
========= Host Frame:/lib64/libc.so.6 (__libc_start_main + 0xf5) [0x21d65]
========= Host Frame:./t1130 [0x2489]
=========
========= ERROR SUMMARY: 2 errors
$
因此,如果我们按原样编译您的代码,它会在cuda-memcheck
. 但是如果我们强制编译器生成实际的循环,那么就会检测到并报告无效访问。
还解释您的评论,如果我们在上述情况下使用-G
而不是,也将检测并报告错误,因为优化被禁用。-DFORCE
cuda-memcheck