tensorflow - 使用 Keras API 在 Tensorflow 2.0 中的多个 GPU 上加载模型后如何继续训练？

Question

我在 Tensorflow 2.0 中使用 Keras API 训练了一个包含 RNN 的文本分类模型。tf.distribute.MirroredStrategy()我使用from here在多个 GPU (2) 上训练了这个模型。tf.keras.callbacks.ModelCheckpoint('file_name.h5')我在每个 epoch 之后保存了模型的检查点。现在，我想从上次保存的检查点开始使用相同数量的 GPU 继续训练。tf.distribute.MirroredStrategy()像这样加载检查点后-

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
   model =tf.keras.models.load_model('file_name.h5')

，它会引发以下错误。

File "model_with_tfsplit.py", line 94, in <module>
    model =tf.keras.models.load_model('TF_model_onfull_2_03.h5') # Loading for retraining
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/saving/save.py", line 138, in load_model
    return hdf5_format.load_model_from_hdf5(filepath, custom_objects, compile)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/saving/hdf5_format.py", line 187, in load_model_from_hdf5
    model._make_train_function()
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/engine/training.py", line 2015, in _make_train_function
    params=self._collected_trainable_weights, loss=self.total_loss)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 500, in get_updates
    grads = self.get_gradients(loss, params)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/keras/optimizer_v2/optimizer_v2.py", line 391, in get_gradients
    grads = gradients.gradients(loss, params)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/gradients_impl.py", line 158, in gradients
    unconnected_gradients)
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/ops/gradients_util.py", line 541, in _GradientsHelper
    for x in xs
  File "/home/rishabh/.local/lib/python2.7/site-packages/tensorflow_core/python/distribute/values.py", line 716, in handle
    raise ValueError("`handle` is not available outside the replica context"
ValueError: `handle` is not available outside the replica context or a `tf.distribute.Strategy.update()` call

现在我不确定问题出在哪里。此外，如果我不使用这种镜像策略来使用多个 GPU，那么训练会从头开始，但经过几个步骤后，它会达到与保存模型之前相同的准确度和损失值。虽然不确定这种行为是否正常。

谢谢你！瑞沙布·萨拉瓦特

score 1 · Accepted Answer

在分布式范围下创建模型，然后使用load_weights方法。在这个例子get_model中，返回一个实例tf.keras.Model

def get_model():
    ...
    return model

mirrored_strategy = tf.distribute.MirroredStrategy()
with mirrored_strategy.scope():
    model = get_model()
    model.load_weights('file_name.h5')
    model.compile(...)
model.fit(...)

score 1 · Accepted Answer

我解决了它类似于@Srihari Humbarwadi，但不同之处在于将策略范围移动到 get_model 函数内。在TF 的文档中对其进行了类似的描述：

def get_model(strategy):
    with strategy.scope():
    ...
    return model

并在训练前调用它：

strategy = tf.distribute.MirroredStrategy()
model = get_model(strategy)
model.load_weights('file_name.h5')

不幸的是，只是打电话

model =tf.keras.models.load_model('file_name.h5')

不启用多 GPU 训练。我的猜测是它与.h5模型格式有关。也许它适用于 tensorflow 原生.pb格式。

tensorflow - 使用 Keras API 在 Tensorflow 2.0 中的多个 GPU 上加载模型后如何继续训练？

2 回答 2

Related

Reference