0

正如文档中所述,我正在尝试在 Common Voice 数据集上训练 DeepSpeech 模型。但它给出了以下错误:

I0421 11:34:32.779112 140581195995008 utils.py:157] NumExpr defaulting to 2 threads.
I Could not find best validating checkpoint.
I Could not find most recent checkpoint.
I Initializing all variables.
Traceback (most recent call last):
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1365, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1348, in _run_fn
    self._extend_graph()
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1388, in _extend_graph
    tf_session.ExtendSession(self._session)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by {{node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams}}with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=247]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

     [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/content/DeepSpeech/DeepSpeech.py", line 12, in <module>
    ds_train.run_script()
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 982, in run_script
    absl.app.run(main)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 303, in run
    _run_main(main, args)
  File "/usr/local/lib/python3.7/dist-packages/absl/app.py", line 251, in _run_main
    sys.exit(main(argv))
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 954, in main
    train()
  File "/content/DeepSpeech/training/deepspeech_training/train.py", line 529, in train
    load_or_init_graph_for_training(session)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 137, in load_or_init_graph_for_training
    _load_or_init_impl(session, methods, allow_drop_layers=True)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 112, in _load_or_init_impl
    return _initialize_all_variables(session)
  File "/content/DeepSpeech/training/deepspeech_training/util/checkpoints.py", line 88, in _initialize_all_variables
    session.run(v.initializer)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 956, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1180, in _run
    feed_dict_tensor, options, run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1359, in _do_run
    run_metadata)
  File "/usr/local/lib/python3.7/dist-packages/tensorflow_core/python/client/session.py", line 1384, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.InvalidArgumentError: No OpKernel was registered to support Op 'CudnnRNNCanonicalToParams' used by node tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams (defined at usr/local/lib/python3.7/dist-packages/tensorflow_core/python/framework/ops.py:1748) with these attrs: [dropout=0, seed=4568, num_params=8, T=DT_FLOAT, input_mode="linear_input", direction="unidirectional", rnn_mode="lstm", seed2=247]
Registered devices: [CPU, XLA_CPU]
Registered kernels:
  <no registered kernels>

     [[tower_0/cudnn_lstm/cudnn_lstm/CudnnRNNCanonicalToParams]]

我的本地机器规格如下:

蟒蛇3.7;库达 10.1;CuDNN 7.6.5;张量流-GPU 1.15.2;GPU GTX 1050 钛

我还安装了以下包和库来准备环境:

!apt-add-repository universe
!apt-get install sox libsox-fmt-mp3 cmake libblkid-dev e2fslibs-dev libboost-all-dev libaudit-dev libeigen3-dev zlib1g-dev libbz2-dev liblzma-dev
!python3.7 -m pip install sox
!python3.7 -m pip install deepspeech-gpu
!python3.7 -m pip install tensorflow-gpu==1.15.2
!python3.7 -m pip install numpy==1.19.5
!python3.7 -m pip install progressbar2
!python3.7 -m pip install progressbar
!python3.7 -m pip install progressbar33
!python3.7 -m pip install ds_ctcdecoder==0.10.0-alpha.3
!python3.7 -m pip install pyogg==0.6.14a1
!python3.7 -m pip install deepspeech
!git clone --branch v0.9.3 https://github.com/mozilla/DeepSpeech
!python3.7 -m pip install --upgrade --force-reinstall -e ./DeepSpeech/
!git clone https://github.com/kpu/kenlm.git
!mkdir -p build
!cmake kenlm
!make -j 4
!wget https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/deepspeech-0.9.3-checkpoint.tar.gz
!curl -LO "https://github.com/mozilla/DeepSpeech/releases/download/v0.9.3/native_client.amd64.cuda.linux.tar.xz"
!mkdir native_client
!tar xvf native_client.amd64.cuda.linux.tar.xz -C native_client

我在本地机器和 google colab vm 上都遇到了同样的问题。

编辑:我还将我的 cuda 和 cudnn 版本分别更改为 10.0 和 7.5.6。但是错误已经存在。

4

3 回答 3

0

感谢您提供额外的信息 Soroush。

LD_LIBRARY_PATH看起来不错,我将假设这些库实际上位于这些路径中。

接下来,我要确保代码在 GPU 本身上执行。

代码可能无法在 GPU 上执行的原因有很多。您提到您的环境是根据 DeepSpeech PlayBook 设置的,这意味着它使用的是 Docker。那是对的吗?如果是这样,您是使用该gpus -all参数生成的 Docker 容器吗?

接下来要检查的是是否nvtop从 DeepSpeech 报告 GPU 活动。当DeepSpeech.py脚本运行时,这应该会导致高compute负载,在nvtop. 如果你没有看到这个,这意味着代码可能没有在 GPU 上执行,这可以解释No OpKernel错误。

于 2021-04-26T00:13:09.093 回答
0

我已经解决了这个问题。该问题是由 Tensorflow 的版本引起的。正如我之前提到的,我使用了 Tf 1.15.2,而我必须使用 Tf 1.15.4。

于 2021-04-27T08:15:59.027 回答
0

在 DeepSpeech Discourse 上看到过类似的错误,问题是 CUDA 安装。

你的$LD_LIBRARY_PATH环境变量的值是多少?

您可以通过以下方式找到它:

$ echo $LD_LIBRARY_PATH
/usr/lib/x86_64-linux-gnu:/usr/local/cuda/bin:/usr/local/cuda/lib64:/usr/local/cuda-11.2/targets/x86_64-linux/lib

我的怀疑是 CUDA 无法找到正确的库。

于 2021-04-24T00:53:06.373 回答