python - 需要开始训练两次以加载检查点（它有效，但为什么？）

Question

我正在修改 deeplab 网络。我在 mobilenet-v3 特征提取器的第一层添加了一个节点，它重用了现有变量。由于不需要额外的参数，理论上我可以加载旧的检查点。

这是我无法理解的情况：

当我在一个新的空文件夹中开始训练时，像这样加载检查点：

python "${WORK_DIR}"/train.py \
  #--didn't change other parameters \
  --train_logdir="${EXP_DIR}/train" \
  --fine_tune_batch_norm=true \
  --tf_initial_checkpoint="init/deeplab/model.ckpt"

我得到一个错误：

ValueError: Total size of new array must be unchanged for MobilenetV3/Conv/BatchNorm/gamma lh_shape: [(16,)], rh_shape: [(480,)]

但是，如果我在一个新的空文件夹中开始训练，请不要加载任何检查点：

python "${WORK_DIR}"/train.py \
  #--didn't change other parameters \
  --train_logdir="${EXP_DIR}/train" \
  --fine_tune_batch_norm=false \
  #--tf_initial_checkpoint="init/deeplab/model.ckpt" #i.e. no checkpoint

我可以顺利开始训练。

让我更困惑的是，如果在同一个文件夹中（这是 train_logdir 没有加载检查点），我尝试使用检查点开始训练，我也可以开始训练而没有错误：

# same code as the first code block
python "${WORK_DIR}"/train.py \
  #--didn't change other parameters \
  --train_logdir="${EXP_DIR}/train" \
  --fine_tune_batch_norm=true \
  --tf_initial_checkpoint="init/deeplab/model.ckpt"

这怎么可能发生？--train_logdir 可以以某种方式存储上次训练的批量标准化参数的形状？

score 0 · Accepted Answer

我在 train_utils.py 中找到了以下代码：（第 203 行）

    if tf.train.latest_checkpoint(train_logdir):
        tf.logging.info('Ignoring initialization; other checkpoint exists')
        return None

    tf.logging.info('Initializing model from path: %s', tf_initial_checkpoint)

在尝试加载“tf_initial_checkpoint”标志中的给定检查点之前，它将尝试从 train_logdir 中的现有检查点加载。

所以当我第二次开始训练时，网络已经加载了第一次训练的变量，这与我预训练的检查点无关。

我的实验还表明，像我一样开始两次训练并没有像我正确加载预训练的检查点时那样获得好的结果。

python - 需要开始训练两次以加载检查点（它有效，但为什么？）

1 回答 1

Related

Reference