不幸的是,您认为代码在为 sm_35 编译并在 sm_30 GPU 上运行时有效的结论是不正确的。罪魁祸首是这样的:
void cbow_cuda(long window, long negative, float alpha, long sentence_length,
int *sen, long layer1_size, float *syn0, long hs, float *syn1,
float *expTable, int *vocab_codelen, char *vocab_code,
int *vocab_point, int *table, long table_size,
long vocab_size, float *syn1neg){
int blockSize = 256;
int gridSize = (sentence_length)/(blockSize/32);
size_t smsize = (blockSize/32)*(2*layer1_size+3)*sizeof(float);
//printf("sm size is %d\n", smsize);
//fflush(stdout);
cbow_kernel<1><<<gridSize, blockSize, smsize>>>
(window, negative, alpha, sentence_length, sen,
layer1_size, syn0, syn1, expTable, vocab_codelen,
vocab_code, vocab_point, table, table_size,
vocab_size, syn1neg);
}
如果内核启动由于不完整的 API 错误检查而失败,此代码将静默失败。如果您为 sm_35 构建并在 sm_30 上运行,内核启动确实会失败。如果将该函数的代码更改为此(添加内核启动错误检查):
void cbow_cuda(long window, long negative, float alpha, long sentence_length,
int *sen, long layer1_size, float *syn0, long hs, float *syn1,
float *expTable, int *vocab_codelen, char *vocab_code,
int *vocab_point, int *table, long table_size,
long vocab_size, float *syn1neg){
int blockSize = 256;
int gridSize = (sentence_length)/(blockSize/32);
size_t smsize = (blockSize/32)*(2*layer1_size+3)*sizeof(float);
//printf("sm size is %d\n", smsize);
//fflush(stdout);
cbow_kernel<1><<<gridSize, blockSize, smsize>>>
(window, negative, alpha, sentence_length, sen,
layer1_size, syn0, syn1, expTable, vocab_codelen,
vocab_code, vocab_point, table, table_size,
vocab_size, syn1neg);
checkCUDAError( cudaPeekAtLastError() );
}
并为 sm_35 编译并运行它,你应该在 sm_30 设备上得到它:
~/cbow/word2vec_cbow$ make
nvcc word2vec.cu -o word2vec -O3 -Xcompiler -march=native -w -Xptxas="-v" -arch=sm_35 -lineinfo
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_' for 'sm_35'
ptxas info : Function properties for _Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 448 bytes cmem[0], 8 bytes cmem[2]
~/cbow/word2vec_cbow$ ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 7 -negative 1 -hs 1 -sample 1e-3 -threads 1 -binary 1 -save-vocab voc #> out 2>&1
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
vocab size = 71290
cbow.cu 114 : invalid device function
IE。内核启动失败,因为在您的应用程序的 CUDA cubin 有效负载中未找到适当的设备代码。这也回答了您之前关于为什么此代码的输出不正确的问题。当使用默认选项构建时,分析内核根本不会在您的硬件上运行。
如果我为 sm_30 构建此代码并在具有 2gb 内存(计算能力 3.0)的 GTX 670 上运行它,我会得到:
~/cbow/word2vec_cbow$ make
nvcc word2vec.cu -o word2vec -O3 -Xcompiler -march=native -w -Xptxas="-v" -arch=sm_30 -lineinfo
ptxas info : 0 bytes gmem
ptxas info : Compiling entry function '_Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_' for 'sm_30'
ptxas info : Function properties for _Z11cbow_kernelILx1EEvllflPKilPVfS3_PKfS1_PKcS1_S1_llS3_
0 bytes stack frame, 0 bytes spill stores, 0 bytes spill loads
ptxas info : Used 34 registers, 448 bytes cmem[0], 12 bytes cmem[2]
~/cbow/word2vec_cbow$ ./word2vec -train text8 -output vectors.bin -cbow 1 -size 200 -window 7 -negative 1 -hs 1 -sample 1e-3 -threads 1 -binary 1 -save-vocab voc #> out 2>&1
Starting training using file text8
Vocab size: 71290
Words in train file: 16718843
vocab size = 71290
Alpha: 0.000009 Progress: 100.00% Words/thread/sec: 1217.45k
IE。代码正确运行完成,没有任何错误。我无法告诉您为什么您无法让代码在您的硬件上运行,因为我无法在我的硬件上重现您的错误。您将需要自己进行一些调试以找到其根本原因。