我尝试对张量板的 summay_op 执行 hvd.allreduce(loss)。
self.avg_loss = hvd.allreduce(self.loss)
self.auc, self.auc_update_op = tf.metrics.auc(
labels=self.label,
predictions=self.sigmoid_prediction,
name=keys.AUC,
summation_method='careful_interpolation',
)
self.avg_auc = hvd.allreduce(self.auc)
tf.summary.scalar(
"loss",
self.avg_loss
)
tf.summary.scalar(
"auc",
self.avg_auc
)
self.summary_op = tf.summary.merge_all()
hooks = [tf.train.StopAtStepHook(last_step=self.steps_per_epoch * args.num_epochs),
tf.train.LoggingTensorHook({
'step': self.global_step,
'loss': self.loss,
'auc': self.auc
}, every_n_iter=100),
tf.train.LoggingTensorHook({
'auc_update_op': self.auc_update_op,
}, formatter=lambda _: "...", every_n_iter=100),
tf.train.NanTensorHook(self.loss),
tf.train.SummarySaverHook(
save_steps=100,
output_dir=args.tensorboard_dir if hvd.rank() == 0 else None,
summary_op=self.summary_op,
),
]
with tf.train.MonitoredTrainingSession(
config=config,
save_checkpoint_secs=60,
save_summaries_steps=None,
save_summaries_secs=None,
checkpoint_dir=args.checkpoint if hvd.rank() == 0 else None,
hooks=hooks) as session:
while not session.should_stop():
session.run(self.train_op)
但是一直遇到这个错误。
一个或多个张量已提交以按等级子集减少、收集或广播,并且正在等待剩余等级超过 60 秒。这可能表明不同的 rank 试图提交不同的张量,或者只有 rank 的子集在提交张量,这将导致死锁。