我的机器有 2 个 GPU,一个 GTX 1070 和一个 GTX 3080。
我有一个带有 tensorflow 1.15 及其所有相关依赖项(CUDA 10、CuDnn 7.6 等)的 conda 环境。当调用我的基于 tensorflow 的训练脚本进行训练时,我得到
#Training on GTX 1070
$ CUDA_VISIBLE_DEVICES=1, python train_script.py
#Output
2021-06-24 21:36:24.253225: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
total_loss: 0.010825163 #Trains as usual
但是,当我尝试在 GTX 3080 上进行训练时
#Training on GTX 3080
$ CUDA_VISIBLE_DEVICES=0, python train_script.py
#Output
2021-06-24 21:43:25.828707: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-06-24 21:44:15.331037: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
...
File "/home/Me/anaconda3/envs/ProjectNet/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
(0) Internal: Blas GEMM launch failed : a.shape=(1000, 2), b.shape=(2, 512), m=1000, n=512, k=2
[[node ProjectNet/fc0/MatMul (defined at home/Me/anaconda3/envs/ProjectNet/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
[[Sum_7/_421]]
(1) Internal: Blas GEMM launch failed : a.shape=(1000, 2), b.shape=(2, 512), m=1000, n=512, k=2
[[node ProjectNet/fc0/MatMul (defined at /home/Me/anaconda3/envs/ProjectNet/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
显卡信息:
$ nvidia-smi
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27 Driver Version: 465.27 CUDA Version: 11.3 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... On | 00000000:04:00.0 Off | N/A |
| 0% 42C P8 5W / 151W | 11MiB / 8119MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
| 1 NVIDIA GeForce ... On | 00000000:2B:00.0 On | N/A |
| 0% 49C P8 36W / 370W | 624MiB / 10001MiB | 27% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
谁能解释为什么在 GTX 3080 上训练失败?