1

我尝试对张量板的 summay_op 执行 hvd.allreduce(loss)。

self.avg_loss = hvd.allreduce(self.loss)
self.auc, self.auc_update_op = tf.metrics.auc(
        labels=self.label,
        predictions=self.sigmoid_prediction,
        name=keys.AUC,
        summation_method='careful_interpolation',
    )
self.avg_auc = hvd.allreduce(self.auc)

tf.summary.scalar(
        "loss", 
        self.avg_loss
    )

tf.summary.scalar(
       "auc", 
        self.avg_auc
)
self.summary_op = tf.summary.merge_all()


hooks = [tf.train.StopAtStepHook(last_step=self.steps_per_epoch * args.num_epochs),
             tf.train.LoggingTensorHook({
                 'step': self.global_step,
                 'loss': self.loss,
                 'auc': self.auc
             }, every_n_iter=100),
              tf.train.LoggingTensorHook({
                 'auc_update_op': self.auc_update_op,
             }, formatter=lambda _: "...", every_n_iter=100),
             tf.train.NanTensorHook(self.loss),
             tf.train.SummarySaverHook(
                save_steps=100,
                output_dir=args.tensorboard_dir if hvd.rank() == 0 else None,
                summary_op=self.summary_op,
            ),
             ]

 with tf.train.MonitoredTrainingSession(
            config=config,
            save_checkpoint_secs=60,
            save_summaries_steps=None,
            save_summaries_secs=None,
            checkpoint_dir=args.checkpoint if hvd.rank() == 0 else None,
            hooks=hooks) as session:
        while not session.should_stop():
             session.run(self.train_op)

但是一直遇到这个错误。

一个或多个张量已提交以按等级子集减少、收集或广播,并且正在等待剩余等级超过 60 秒。这可能表明不同的 rank 试图提交不同的张量,或者只有 rank 的子集在提交张量,这将导致死锁。

4

0 回答 0