tensorflow - 将代码从 keras 转换为 tf.keras 会导致问题

Question

我正在使用本文中的代码在 Keras 中学习机器翻译。这篇文章的代码在 GPU 和 CPU 上运行良好。

现在我想利用 Google Colab TPU。代码没有按原样进行 TPU 化，我需要朝 TF 方向移动。

根据 TPU 的Fashion MNIST 笔记本，我在 Tensorflow 中使用 Keras 层，而不是相反。在进入 TPU 部分之前，我正在执行此转换以查看它是否仍可在 GPU 上运行。这意味着主要改变这个功能，从：

from keras.models import Sequential
from keras.layers import LSTM
from keras.layers import Dense
from keras.layers import Embedding
from keras.layers import RepeatVector
from keras.layers import TimeDistributed
# define NMT model
def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
    model = Sequential()
    model.add(Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
    model.add(LSTM(n_units))
    model.add(RepeatVector(tar_timesteps))
    model.add(LSTM(n_units, return_sequences=True))
    model.add(TimeDistributed(Dense(tar_vocab, activation='softmax')))
    return model

至：

import tensorflow as tf
# define NMT model
def define_model(src_vocab, tar_vocab, src_timesteps, tar_timesteps, n_units):
    model = tf.keras.models.Sequential()
    model.add(tf.keras.layers.Embedding(src_vocab, n_units, input_length=src_timesteps, mask_zero=True))
    model.add(tf.keras.layers.LSTM(n_units))
    model.add(tf.keras.layers.RepeatVector(tar_timesteps))
    model.add(tf.keras.layers.LSTM(n_units, return_sequences=True))
    model.add(tf.keras.layers.TimeDistributed(tf.keras.layers.Dense(tar_vocab, activation='softmax')))
    return model

然后我做

model = define_model(swh_vocab_size, eng_vocab_size, swh_length, eng_length, 256)
model.compile(optimizer='adam', loss='categorical_crossentropy')
model.fit(trainX, trainY, epochs=1, batch_size=64, validation_data=(testX, testY), callbacks=[checkpoint], verbose=2)

但是，当我跑步时，这会导致投诉：

lib\site-packages\tensorflow\python\ops\gradients_impl.py:112: UserWarning: Converting sparse IndexedSlices to a dense Tensor of unknown shape. This may consume a large amount of memory.
  "Converting sparse IndexedSlices to a dense Tensor of unknown shape. "

然后在 GPU 内部适配期间，它在 BLAS 负载上失败，如下所示：

InternalError: Blas GEMM launch failed : a.shape=(64, 256), b.shape=(256, 256), m=64, n=256, k=256
     [[{{node lstm/while/MatMul}} = MatMul[T=DT_FLOAT, _class=["loc:@training/Adam/gradients/lstm/while/strided_slice_grad/StridedSliceGrad"], transpose_a=false, transpose_b=false, _device="/job:localhost/replica:0/task:0/device:GPU:0"](lstm/while/TensorArrayReadV3, lstm/while/strided_slice)]]
     [[{{node loss/time_distributed_loss/broadcast_weights/assert_broadcastable/AssertGuard/Assert/Switch/_175}} = _Recv[client_terminated=false, recv_device="/job:localhost/replica:0/task:0/device:CPU:0", send_device="/job:localhost/replica:0/task:0/device:GPU:0", send_device_incarnation=1, tensor_name="edge_2728_...ert/Switch", tensor_type=DT_BOOL, _device="/job:localhost/replica:0/task:0/device:CPU:0"]()]]

这是在转换为 TPU 模型之前。在进行最终的 TPU 转换之前，我只是想确保事情仍然在 CPU 和 GPU 上运行。他们没有。关于为什么我不能走这么远的任何想法？

score 0 · Accepted Answer

我认为其中一些可能与在 Windows 上小心安装 Anaconda Python 有关。这是我认为正确的顺序（假设您已经安装了 CUDA 9.0 和 cuDNN）：

根据这个问题，安装与用于构建 tensorflow 的版本相匹配的 Visual Studio 版本。添加路径

C:\Program Files (x86)\Microsoft Visual Studio 14.0\VC

到路径。

这：在运行 Python 之前在脚本中运行 vcvarsall。然后：

使用以管理员身份运行启动 CMD 窗口。这是至关重要的。
conda create --name myenv
conda 激活 myenv
conda 安装 tensorflow-gpu
康达安装mingw
康达安装 libpython
conda 安装 mkl 服务

在进行更多测试后，我将在稍后将其标记为正确。第 3 步和第 4 步来自这个问题和从头开始并严格使用 conda install 而不是 pip install 从这个问题开始的概念。

tensorflow - 将代码从 keras 转换为 tf.keras 会导致问题

1 回答 1

Related

Reference