我在使用 horovod 回调运行 model.fit 时看到以下错误。如果我跳过回调 model.fit 运行良好。注意:我正在使用horovod.tensorflow.keras
包,我的模型基于tensorflow.keras
(我不是直接使用 keras 包,而是来自 tensorflow)
FailedPreconditionError: Error while reading resource variable conv1d/kernel from Container: localhost. This could mean that the variable was uninitialized. Not found: Resource localhost/conv1d/kernel/N10tensorflow3VarE does not exist.
[[{{node conv1d/conv1d/ExpandDims_1/ReadVariableOp}}]]
回调如下
callbacks = [
# Horovod: broadcast initial variable states from rank 0 to all other processes.
# This is necessary to ensure consistent initialization of all workers when
# training is started with random weights or restored from a checkpoint.
hvd.callbacks.BroadcastGlobalVariablesCallback(0),
TensorBoard(log_dir='boardlogs/{}'.format(datetime.datetime.now())) #report logs to tensorboard
]
# Horovod: save checkpoints only on worker 0 to prevent other workers from corrupting them.
if hvd.rank() == 0:
horovod_callbacks.append(tf.keras.callbacks.ModelCheckpoint('.horovod-cps/checkpoint-{epoch}.h5')
history = model.fit(X, y, epochs=500, batch_size=64, callbacks=callbacks, verbose=1 if hvd.rank() == 0 else 0)
环境: 框架:tensorflow.keras Tensorflow 版本 1.13.1 Keras 版本 2.2.4-tf Horovod 版本:horovod==0.17.0.post1 Python 版本:3.6