0

我一直从你们那里得到很多帮助。

我正在寻求您对此错误的帮助。

下面是我的训练环境。



[主机]

操作系统:Ubuntu 20.04

显卡:RTX 3090

码头工人版本:20.10.7


[培训代码]

Python版本:2.7

张量流:1.13

我正在测试这个 github 代码进行研究:https ://github.com/Google-Health/records-research/tree/master/graph-convolutional-transformer


我发现 RTX30XX GPU 需要 CUDA 11 或更高版本,但训练代码需要 CUDA 10 才能使用 gpu 进行训练。

所以,我认为使用 docker image 是必不可少的。



[我试过的]

1. 使用 Docker

我在下面使用了 docker 图像。

  • 张量流/张量流:1.13.2-gpu

  • 张量流/张量流:1.15.0-gpu

  • nvcr.io/nvidia/tensorflow:20.01-tf1-py2

但是,所有三个 docker 图像都会产生相同的结果。(错误)

错误信息

2021-12-30 08:00:16.577900: I tensorflow/stream_executor/dso_loader.cc:152] successfully opened CUDA library libcublas.so.10.0 locally
2021-12-30 08:00:25.437865: E tensorflow/stream_executor/cuda/cuda_blas.cc:698] failed to run cuBLAS routine cublasSgemm_v2: CUBLAS_STATUS_EXECUTION_FAILED
Traceback (most recent call last):
  File "./train.py", line 70, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "./train.py", line 63, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1158, in _train_model_default
    saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1407, in _train_with_estimator_spec
    _, loss = mon_sess.run([estimator_spec.train_op, estimator_spec.loss])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 676, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1171, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1270, in run
    raise six.reraise(*original_exc_info)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1255, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1327, in run
    run_metadata=run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/monitored_session.py", line 1091, in run
    return self._sess.run(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 929, in run
    run_metadata_ptr)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1152, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1328, in _do_run
    run_metadata)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 1348, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InternalError: Blas GEMM launch failed : a.shape=(3232, 128), b.shape=(128, 128), m=3232, n=128, k=128
         [[node graph_convolutional_transformer/dense_2/Tensordot/MatMul (defined at /tf/graph_convolutional_transformer.py:325) ]]
         [[node add_9 (defined at /tf/graph_convolutional_transformer.py:758) ]]

Caused by op u'graph_convolutional_transformer/dense_2/Tensordot/MatMul', defined at:
  File "./train.py", line 70, in <module>
    tf.app.run(main)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/platform/app.py", line 125, in run
    _sys.exit(main(argv))
  File "./train.py", line 63, in main
    tf.estimator.train_and_evaluate(estimator, train_spec, eval_spec)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 471, in train_and_evaluate
    return executor.run()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 611, in run
    return self.run_local()
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/training.py", line 712, in run_local
    saving_listeners=saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 358, in train
    loss = self._train_model(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1124, in _train_model
    return self._train_model_default(input_fn, hooks, saving_listeners)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1154, in _train_model_default
    features, labels, model_fn_lib.ModeKeys.TRAIN, self.config)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow_estimator/python/estimator/estimator.py", line 1112, in _call_model_fn
    model_fn_results = self._model_fn(features=features, **kwargs)
  File "/tf/graph_convolutional_transformer.py", line 792, in model_fn
    model, feature_embedder, features, training)
  File "/tf/graph_convolutional_transformer.py", line 720, in get_prediction
    embeddings, masks[:, :, None], guide, prior_guide, training)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/tf/graph_convolutional_transformer.py", line 325, in call
    v = self._layers['V'][i](features)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/engine/base_layer.py", line 554, in __call__
    outputs = self.call(inputs, *args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/keras/layers/core.py", line 968, in call
    outputs = standard_ops.tensordot(inputs, self.kernel, [[rank - 1], [0]])
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 3583, in tensordot
    ab_matmul = matmul(a_reshape, b_reshape)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/math_ops.py", line 2455, in matmul
    a, b, transpose_a=transpose_a, transpose_b=transpose_b, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 5333, in mat_mul
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/op_def_library.py", line 788, in _apply_op_helper
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/util/deprecation.py", line 507, in new_func
    return func(*args, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 3300, in create_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1801, in __init__
    self._traceback = tf_stack.extract_stack()

InternalError (see above for traceback): Blas GEMM launch failed : a.shape=(3232, 128), b.shape=(128, 128), m=3232, n=128, k=128
         [[node graph_convolutional_transformer/dense_2/Tensordot/MatMul (defined at /tf/graph_convolutional_transformer.py:325) ]]
         [[node add_9 (defined at /tf/graph_convolutional_transformer.py:758) ]]

docker中的nvidia-smi

Thu Dec 30 08:06:33 2021
+-----------------------------------------------------------------------------+
| NVIDIA-SMI 470.86       Driver Version: 470.86       CUDA Version: 11.4     |
|-------------------------------+----------------------+----------------------+
| GPU  Name        Persistence-M| Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|                               |                      |               MIG M. |
|===============================+======================+======================|
|   0  NVIDIA GeForce ...  Off  | 00000000:01:00.0 Off |                  N/A |
|  0%   36C    P8    21W / 420W |     19MiB / 24268MiB |      0%      Default |
|                               |                      |                  N/A |
+-------------------------------+----------------------+----------------------+

+-----------------------------------------------------------------------------+
| Processes:                                                                  |
|  GPU   GI   CI        PID   Type   Process name                  GPU Memory |
|        ID   ID                                                   Usage      |
|=============================================================================|
+-----------------------------------------------------------------------------+

我也试过这些

我搜索了错误“Blas GEMM”,得到了以下解决方案:

tensorflow-gpu 不适用于 Blas GEMM 启动失败

https://stackoverflow.com/a/65523597/17757583

cublas的tensorflow运行错误


但是,这些并不能解决错误...

所以,我在下面尝试了其他方法。



2. 其他方法

https://www.pugetsystems.com/labs/hpc/How-To-Install-TensorFlow-1-15-for-NVIDIA-RTX30-GPUs-without-docker-or-CUDA-install-2005/

我试过这个方法,但是设置完成后conda env中的python版本只有3.x。(我应该使用 python 2.7)



是否有任何其他解决方案可以使用 docker 修复此错误?

4

0 回答 0