0

我的机器有 2 个 GPU,一个 GTX 1070 和一个 GTX 3080。

我有一个带有 tensorflow 1.15 及其所有相关依赖项(CUDA 10、CuDnn 7.6 等)的 conda 环境。当调用我的基于 tensorflow 的训练脚本进行训练时,我得到

#Training on GTX 1070
$ CUDA_VISIBLE_DEVICES=1, python train_script.py

#Output
2021-06-24 21:36:24.253225: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
total_loss: 0.010825163 #Trains as usual

但是,当我尝试在 GTX 3080 上进行训练时

#Training on GTX 3080
$ CUDA_VISIBLE_DEVICES=0, python train_script.py

#Output
2021-06-24 21:43:25.828707: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcublas.so.10.0
2021-06-24 21:44:15.331037: E tensorflow/stream_executor/cuda/cuda_blas.cc:428] failed to run cuBLAS routine: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
...
File "/home/Me/anaconda3/envs/ProjectNet/lib/python3.7/site-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: 2 root error(s) found.
  (0) Internal: Blas GEMM launch failed : a.shape=(1000, 2), b.shape=(2, 512), m=1000, n=512, k=2
     [[node ProjectNet/fc0/MatMul (defined at home/Me/anaconda3/envs/ProjectNet/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]
     [[Sum_7/_421]]
  (1) Internal: Blas GEMM launch failed : a.shape=(1000, 2), b.shape=(2, 512), m=1000, n=512, k=2
     [[node ProjectNet/fc0/MatMul (defined at /home/Me/anaconda3/envs/ProjectNet/lib/python3.7/site-packages/tensorflow_core/python/framework/ops.py:1748) ]]

显卡信息:

$ nvidia-smi 


+-----------------------------------------------------------------------------+
| NVIDIA-SMI 465.27       Driver Version: 465.27       CUDA Version: 11.3     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  On   | 00000000:04:00.0 Off |                  N/A |
|  0%   42C    P8     5W / 151W |     11MiB /  8119MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+
|   1  NVIDIA GeForce ...  On   | 00000000:2B:00.0  On |                  N/A |
|  0%   49C    P8    36W / 370W |    624MiB / 10001MiB |     27%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

谁能解释为什么在 GTX 3080 上训练失败?

4

0 回答 0