0

我正在尝试带有玩具示例(见下文)的新 PGI 社区版本(17.4),并且在调用acc_init.

重现错误的代码是:

#include <openacc.h>
#include <cuda_runtime_api.h>
#include <stdio.h>

int main()
{
   acc_init( acc_device_nvidia );

   int ndev = acc_get_num_devices( acc_device_nvidia );

   printf("Num OpenACC devices: %d\n", ndev);

   cudaGetDeviceCount(&ndev);

   printf("Num CUDA devices: %d\n", ndev);

   return 0;
}

编译: /usr/local/pgi/linux86-64/17.4/bin/pgcc -acc -ta=tesla -Mcuda ./test.c -o oacc_test.pgi

cuda memcheck 输出:

$ cuda-memcheck ./oacc_test.pgi 
========= CUDA-MEMCHECK
========= Program hit CUDA_ERROR_INVALID_DEVICE (error 101) due to "invalid device ordinal" on CUDA API call to cuDevicePrimaryCtxRetain. 
=========     Saved host backtrace up to driver entry point at error
=========     Host Frame:/usr/lib/x86_64-linux-gnu/libcuda.so (cuDevicePrimaryCtxRetain + 0x15c) [0x1e8d1c]
=========     Host Frame:/usr/local/pgi/linux86-64/17.4/lib/libaccnc.so (__pgi_uacc_cuda_initdev + 0x80b) [0x6f0b]
=========     Host Frame:/usr/local/pgi/linux86-64/17.4/lib/libaccg.so (__pgi_uacc_enumerate + 0x148) [0x11388]
=========     Host Frame:/usr/local/pgi/linux86-64/17.4/lib/libaccg.so (__pgi_uacc_initialize + 0x5b) [0x117ab]
=========     Host Frame:/usr/local/pgi/linux86-64/17.4/lib/libaccapi.so (acc_init + 0x22) [0xe4f2]
=========     Host Frame:./oacc_test.pgi [0xbc4]
=========     Host Frame:/lib/x86_64-linux-gnu/libc.so.6 (__libc_start_main + 0xf1) [0x202b1]
=========     Host Frame:./oacc_test.pgi [0xaca]
=========
Num OpenACC devices: 1
Num CUDA devices: 1
========= ERROR SUMMARY: 1 error

显然__pgi_uacc_cuda_initdev将“-1”作为第二个参数(CUdevice dev)传递给cuDevicePrimaryCtxRetain(bug?):

Breakpoint 1, 0x00007ffff4ab0bc0 in cuDevicePrimaryCtxRetain () from /usr/lib/x86_64-linux-gnu/libcuda.so
(cuda-gdb) p /x $rsi
$7 = 0xffffffff

我想这不正常。这是 17.4 的错误还是我的安装损坏了?

4

1 回答 1

3

这是正常且良性的错误。基本上发生的事情是 PGI 运行时正在查询是否已经创建了 CUDA 上下文。但是由于没有 CUDA 运行时调用来查询上下文的存在,我们称之为“cuDevicePrimaryCtxRetain”。如果它出错,那么我们知道我们需要创建一个新的上下文。

请注意,在 PGI 版本 17.7 中,我们确实稍微更改了此调用,因此在运行 cuda-memcheck 时您将不再看到错误。

于 2017-08-29T19:03:13.533 回答