tensorflow - Tensorflow：生成的检查点文件的数量与 model_main_tf2.py 的配置输入不匹配

Question

我在 CoLab (Tensorflow )中使用TensorFlow 2 Detection Model Zoo中的预训练模型进行对象检测。v2.7.0

（新）数据集包含 255 张用于训练的图像。train_config > batch_size在pipeline.config是8。_ 所以我打算每一个 epoch 做一个检查点（因此，checkpoint_every_n：255/8= ~32），并且将训练100个 epoch ；因此，num_train_steps是3200。因此，我假设将生成100 个检查点文件。

!python model_main_tf2.py \
  --pipeline_config_path="./models/pipeline.config" \
  --model_dir="./models" \
  --checkpoint_every_n=32 \
  --num_train_steps=3200 \
  --alsologtostderr

但是，训练后只有7 个检查点文件。tree /F这是Windows 命令行上该工具的快照。

我错过了什么（例如某处的附加配置）吗？我的上述假设正确吗？或者这只是一个错误？

score 0 · Accepted Answer

在文件model_main_tf2.py中，主循环是：

with strategy.scope():
      model_lib_v2.train_loop(
          pipeline_config_path=FLAGS.pipeline_config_path,
          model_dir=FLAGS.model_dir,
          train_steps=FLAGS.num_train_steps,
          use_tpu=FLAGS.use_tpu,
          checkpoint_every_n=FLAGS.checkpoint_every_n,
          record_summaries=FLAGS.record_summaries)

检查model_lib_v2.train_loop()（链接），有一个默认参数：

def train_loop(
    pipeline_config_path,
    model_dir,
    config_override=None,
    train_steps=None,
    use_tpu=False,
    save_final_config=False,
    checkpoint_every_n=1000,
    checkpoint_max_to_keep=7, # Here!
    record_summaries=True,

这就是为什么只生成 7 个检查点文件的原因。他们应该是最后一个。

此外，该论点并未checkpoint_every_n得到严格遵守。它受的影响NUM_STEPS_PER_ITERATION，它是硬编码的100（不能从外部更改）。在文件model_lib_v2.py、函数train_loop()、行中：

if ((int(global_step.value()) - checkpointed_step) >=
              checkpoint_every_n):
    manager.save() # Here!
    checkpointed_step = int(global_step.value())

...含义：在每次迭代中，它移动 100 步，我表示num_train_steps为3200；所以应该每 100 个步骤创建一个检查点，最终有 32 个文件。在最开始（行）加上一个检查点文件，我们最终得到ckpt-33如图所示。

tensorflow - Tensorflow：生成的检查点文件的数量与 model_main_tf2.py 的配置输入不匹配

1 回答 1

Related

Reference