我终于设法让 CUDA 在带有 Kesla T80 的 Microsoft Azure 服务器上工作。现在我需要让 cuDNN 工作,但 TensorFlow 不会加载它。
这是来自 TensorFlow 的消息:
>>> import tensorflow as tf
>>> tf.Session()
2017-04-27 13:05:51.476251: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-27 13:05:51.476306: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-27 13:05:51.476338: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-04-27 13:05:51.476366: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-04-27 13:05:51.476394: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-04-27 13:05:58.164781: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID ad52:00:00.0
Total memory: 11.17GiB
Free memory: 11.11GiB
2017-04-27 13:05:58.164822: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2017-04-27 13:05:58.164835: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2017-04-27 13:05:58.164853: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: ad52:00:00.0)
<tensorflow.python.client.session.Session object at 0x7fc3c76c0050>
所以我看到没有加载 cuDNN 库。
我有正确的文件/cuda-8.0/include/
和/cuda-8.0/lib64/
$ ls /usr/local/cuda-8.0/include/ | grep "cudnn"
cudnn.h
$ ls /usr/local/cuda-8.0/lib64/ | grep "cudnn"
libcudnn.so
libcudnn.so.5
libcudnn.so.5.1.10
libcudnn_static.a
我的~/.bashrc
文件有正确的路径
export CUDA_HOME=/usr/local/cuda8.0
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
编辑
更改.bashrc
为:
export CUDA_HOME=/usr/local/cuda-8.0
export PATH=${CUDA_HOME}/bin:${PATH}
export LD_LIBRARY_PATH=${CUDA_HOME}/lib64:/usr/local/cuda/lib64:${LD_LIBRARY_PATH}
export PATH=${CUDA_HOME}/include:${PATH}
仍然没有运气。
来自 nvidia-smi 的输出:
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 375.51 Driver Version: 375.51 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
|===============================+======================+======================|
| 0 Tesla K80 Off | AD52:00:00.0 Off | 0 |
| N/A 71C P0 61W / 149W | 0MiB / 11439MiB | 24% Default |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: GPU Memory |
| GPU PID Type Process name Usage |
|=============================================================================|
| No running processes found |
+-----------------------------------------------------------------------------+
我正在使用 tensorflow 版本 1.1.0、Ubuntu 16.04 和 CUDA 8.0。
编辑
所以我只是试图删除 cudnn 文件并加载 tensorflow,这给了我一个错误。一些东西找不到 libcuddn.so.5。所以我认为它会加载它,但我的印象是,如果使用 cuDNN,TensorFlow 会连同“libcuddn.so 加载成功”一起写一些东西。