我一直从你们那里得到很多帮助。
我正在寻求您对此错误的帮助。
下面是我的训练环境。
[主机]
操作系统:Ubuntu 20.04
显卡:RTX 3090
码头工人版本:20.10.7
[培训代码]
Python版本:2.7
张量流:1.13
我正在测试这个 github 代码进行研究:https ://github.com/Google-Health/records-research/tree/master/graph-convolutional-transformer
我发现 RTX30XX GPU 需要 CUDA 11 或更高版本,但训练代码需要 CUDA 10 才能使用 gpu 进行训练。
所以,我认为使用 docker image 是必不可少的。
[我试过的]
1. 使用 Docker
我在下面使用了 docker 图像。
张量流/张量流:1.13.2-gpu
张量流/张量流:1.15.0-gpu
nvcr.io/nvidia/tensorflow:20.01-tf1-py2
但是,所有三个 docker 图像都会产生相同的结果。(错误)
错误信息
2021-12-30 08:00:16.577900: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2021-12-30 08:00:25.437865: E tensorflow/stream_executor/cuda/cuda_blas.cc:698] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
File "./train.py", line 70, in <module>
tf.app.run(main)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./train.py", line 63, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
return self.run_local()
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
_, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 676, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
raise six.reraise(*original_exc_info)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
run_metadata=run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
return self._sess.run(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
run_metadata_ptr)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
feed_dict_tensor, options, run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
run_metadata)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(3232, 128), b.shape=(128, 128), m=3232, n=128, k=128
[[node graph_convolutional_transformer/dense_2/Tensordot/MatMul (defined at /tf/graph_convolutional_transformer.py:325) ]]
[[node add_9 (defined at /tf/graph_convolutional_transformer.py:758) ]]
Caused by op u'graph_convolutional_transformer/dense_2/Tensordot/MatMul', defined at:
File "./train.py", line 70, in <module>
tf.app.run(main)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
_sys.exit(main(argv))
File "./train.py", line 63, in main
tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
return executor.run()
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
return self.run_local()
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
saving_listeners=saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
loss = self._train_model(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
return self._train_model_default(input_fn, hooks, saving_listeners)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
model_fn_results = self._model_fn(features=features, **kwargs)
File "/tf/graph_convolutional_transformer.py", line 792, in model_fn
model, feature_embedder, features, training)
File "/tf/graph_convolutional_transformer.py", line 720, in get_prediction
embeddings, masks[:, :, None], guide, prior_guide, training)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/tf/graph_convolutional_transformer.py", line 325, in call
v = self._layers['V'][i](features)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
outputs = self.call(inputs, *args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/layers/core.py", line 968, in call
outputs = standard_ops.tensordot(inputs, self.kernel, [[rank - 1], [0]])
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 3583, in tensordot
ab_matmul = matmul(a_reshape, b_reshape)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 2455, in matmul
a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 5333, in mat_mul
name=name)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
return func(*args, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
op_def=op_def)
File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
self._traceback = tf_stack.extract_stack()
InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(3232, 128), b.shape=(128, 128), m=3232, n=128, k=128
[[node graph_convolutional_transformer/dense_2/Tensordot/MatMul (defined at /tf/graph_convolutional_transformer.py:325) ]]
[[node add_9 (defined at /tf/graph_convolutional_transformer.py:758) ]]
docker中的nvidia-smi
Thu Dec 30 08:06:33 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86 Driver Version: 470.86 CUDA Version: 11.4 |
|-------------------------------+----------------------+----------------------+
| GPU Name Persistence-M| Bus-Id Disp.A | Volatile Uncorr. ECC |
| Fan Temp Perf Pwr:Usage/Cap| Memory-Usage | GPU-Util Compute M. |
| | | MIG M. |
|===============================+======================+======================|
| 0 NVIDIA GeForce ... Off | 00000000:01:00.0 Off | N/A |
| 0% 36C P8 21W / 420W | 19MiB / 24268MiB | 0% Default |
| | | N/A |
+-------------------------------+----------------------+----------------------+
+-----------------------------------------------------------------------------+
| Processes: |
| GPU GI CI PID Type Process name GPU Memory |
| ID ID Usage |
|=============================================================================|
+-----------------------------------------------------------------------------+
我也试过这些
我搜索了错误“Blas GEMM”,得到了以下解决方案:
tensorflow-gpu 不适用于 Blas GEMM 启动失败
https://stackoverflow.com/a/65523597/17757583
但是,这些并不能解决错误...
所以,我在下面尝试了其他方法。
2. 其他方法
我试过这个方法,但是设置完成后conda env中的python版本只有3.x。(我应该使用 python 2.7)
是否有任何其他解决方案可以使用 docker 修复此错误?