最近有同事需要使用NVML查询设备信息,于是下载了Tesla开发包3.304.5,将文件nvml.h复制到/usr/include。为了测试,我在 tdk_3.304.5/nvml/example 中编译了示例代码,它运行良好。
一个周末,系统发生了一些变化(我无法确定发生了什么变化,而且我不是唯一可以访问机器的人),现在任何使用 nvml.h 的代码(例如示例代码)都失败并出现以下错误:
Failed to initialize NVML:
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
WARNING:
You should always run with libnvidia-ml.so that is installed with your NVIDIA Display Driver. By default it's installed in /usr/lib and /usr/lib64. libnvidia-ml.so in TDK package is a stub library that is attached only for build purposes (e.g. machine that you build your application doesn't have to have Display Driver installed).
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
但是,我仍然可以运行 nvidia-smi 并读取有关我的 K20m 状态的信息,据我所知,nvidia-smi 只是对 nvml.h 的一组调用。我收到的错误消息有点神秘,但我相信它告诉我 nvidia-ml.so 文件需要与我在系统上安装的 Tesla 驱动程序相匹配。为了确保一切正确,我重新下载了 CUDA 5.0 并安装了驱动程序、CUDA 运行时和测试文件。我确信 nvidia-ml.so 文件与驱动程序匹配(两者都是 304.54),所以我很困惑可能出了什么问题。我可以使用 nvcc 编译和运行测试代码,也可以运行我自己的 CUDA 代码,只要它不包含 nvml.h。
有没有人遇到过这个错误或者有任何关于纠正这个问题的想法?
$ ls -la /usr/lib/libnvidia-ml*
lrwxrwxrwx. 1 root root 17 Jul 19 10:08 /usr/lib/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root 22 Jul 19 10:08 /usr/lib/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 391872 Jul 19 10:08 /usr/lib/libnvidia-ml.so.304.54
$ ls -la /usr/lib64/libnvidia-ml*
lrwxrwxrwx. 1 root root 17 Jul 19 10:08 /usr/lib64/libnvidia-ml.so -> libnvidia-ml.so.1
lrwxrwxrwx. 1 root root 22 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.1 -> libnvidia-ml.so.304.54
-rwxr-xr-x. 1 root root 394792 Jul 19 10:08 /usr/lib64/libnvidia-ml.so.304.54
$ cat /proc/driver/nvidia/version
NVRM version: NVIDIA UNIX x86_64 Kernel Module 304.54 Sat Sep 29 00:05:49 PDT 2012
GCC version: gcc version 4.4.7 20120313 (Red Hat 4.4.7-3) (GCC)
$ nvcc -V
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2012 NVIDIA Corporation
Built on Fri_Sep_21_17:28:58_PDT_2012
Cuda compilation tools, release 5.0, V0.2.1221
$ whereis nvml.h
nvml: /usr/include/nvml.h
$ ldd example
linux-vdso.so.1 => (0x00007fff2da66000)
libnvidia-ml.so.1 => /usr/lib64/libnvidia-ml.so.1 (0x00007f33ff6db000)
libc.so.6 => /lib64/libc.so.6 (0x000000300e400000)
libpthread.so.0 => /lib64/libpthread.so.0 (0x000000300ec00000)
libdl.so.2 => /lib64/libdl.so.2 (0x000000300e800000)
/lib64/ld-linux-x86-64.so.2 (0x000000300e000000)
编辑:解决方案是删除所有额外的 libnvidia-ml.so 实例。出于某种原因,他们有很多。
$ sudo find / -name 'libnvidia-ml*'
/usr/lib/libnvidia-ml.so.304.54
/usr/lib/libnvidia-ml.so
/usr/lib/libnvidia-ml.so.1
/usr/opt/lib/libnvidia-ml.so
/usr/opt/lib/libnvidia-ml.so.1
/usr/opt/lib64/libnvidia-ml.so
/usr/opt/lib64/libnvidia-ml.so.1
/usr/opt/nvml/lib/libnvidia-ml.so
/usr/opt/nvml/lib/libnvidia-ml.so.1
/usr/opt/nvml/lib64/libnvidia-ml.so
/usr/opt/nvml/lib64/libnvidia-ml.so.1
/usr/lib64/libnvidia-ml.so.304.54
/usr/lib64/libnvidia-ml.so
/usr/lib64/libnvidia-ml.so.1
/lib/libnvidia-ml.so.old
/lib/libnvidia-ml.so.1