tensorflow - 在 ai 平台（Python 3.7、TF 2.1）上完成作业后未打印超参数指标标签

Question

以前，当我们在运行时版本 1.15 和 Tensorflow 1.15 中使用相同的代码时，这些作业被执行并提高了准确性，因为它被传递了一个超度量标签以创建 AI 平台作业。

模型正在执行，损失和准确率打印在日志中，但超参数调整作业仍然失败。

 return tf.estimator.EstimatorSpec(
        mode=mode, loss=loss, train_op=train_op,
        training_hooks=[get_logging_hooks(loss),
                        get_logging_hooks_accuracy(accuracy[1])])
eval_metric_ops = {
      'accuracy': accuracy, 'f1_score': f1_score, 'recall': recall,
      'precision': precision}

  return tf.estimator.EstimatorSpec(mode=mode,
                                    loss=loss,
                                    eval_metric_ops=eval_metric_ops)


custom_estimator_model = tf.estimator.Estimator(
        model_fn=self.model_fn(), model_dir=model_dir,
        config=get_run_config(self.strategy))

train_spec = tf.estimator.TrainSpec(input_fn=self.input_fn,
                                    max_steps=self.train_steps)
assert (self.eval_input_fn), "Please provide eval input function"
eval_spec = tf.estimator.EvalSpec(input_fn=self.eval_input_fn,
                                  steps=self.eval_steps,
                                  exporters=self.exporters,
                                  throttle_secs=self.eval_throttle_secs)
tf.estimator.train_and_evaluate(custom_estimator_model,
                                train_spec,
                                eval_spec)

下面提到了为参数超调而传递的配置文件

config['trainingInput']['hyperparameters'] = {
            'goal': 'MAXIMIZE',
            'hyperparameterMetricTag': 'accuracy',
            'maxTrials': self.max_trials,
            'maxParallelTrials': 2,
            # 'maxFailedTrials' : 1,
            'enableTrialEarlyStopping': False,
            'params': [{
                'parameterName': 'batch-size',
                'type': 'DISCRETE',
                'discreteValues': self.batch_size}, {
                    'parameterName': 'learning-rate',
                    'type': 'DISCRETE',
                    'discreteValues': self.learning_rate}]}

当我尝试使用相同的代码启动作业时，超调作业在 tensorflow 2.1 版和 python 3.7 版中失败

没有错误日志，所以我没有粘贴（在 StackDriver 上的日志中显示作业已成功完成）。

我在日志中看到模型正在接受训练（纪元正在运行，日志中的损失和准确度值），但是在经历了所有纪元之后，所有 HyperTune 试验都显示状态“失败”并且指标标签没有打印在控制台上. 超参数调整作业也不会发生评估步骤。

tensorflow - 在 ai 平台（Python 3.7、TF 2.1）上完成作业后未打印超参数指标标签

0 回答 0

Related

Reference