tensorflow - Tensorflow：使用 Batch Normalization 会导致较差（不稳定）的验证损失和准确性

Question

我正在尝试使用tf.layers.batch_normalization()使用批量标准化，我的代码如下所示：

def create_conv_exp_model(fingerprint_input, model_settings, is_training):


  # Dropout placeholder
  if is_training:
    dropout_prob = tf.placeholder(tf.float32, name='dropout_prob')

  # Mode placeholder
  mode_placeholder = tf.placeholder(tf.bool, name="mode_placeholder")

  he_init = tf.contrib.layers.variance_scaling_initializer(mode="FAN_AVG")

  # Input Layer
  input_frequency_size = model_settings['bins']
  input_time_size = model_settings['spectrogram_length']
  net = tf.reshape(fingerprint_input,
                   [-1, input_time_size, input_frequency_size, 1],
                   name="reshape")
  net = tf.layers.batch_normalization(net, 
                                      training=mode_placeholder,
                                      name='bn_0')

  for i in range(1, 6):
    net = tf.layers.conv2d(inputs=net,
                           filters=8*(2**i),
                           kernel_size=[5, 5],
                           padding='same',
                           kernel_initializer=he_init,
                           name="conv_%d"%i)
    net = tf.layers.batch_normalization(net,
                                        training=mode_placeholder,
                                        name='bn_%d'%i)
    with tf.name_scope("relu_%d"%i):
      net = tf.nn.relu(net)
    net = tf.layers.max_pooling2d(net, [2, 2], [2, 2], 'SAME', 
                                  name="maxpool_%d"%i)

  net_shape = net.get_shape().as_list()
  net_height = net_shape[1]
  net_width = net_shape[2]
  net = tf.layers.conv2d( inputs=net,
                          filters=1024,
                          kernel_size=[net_height, net_width],
                          strides=(net_height, net_width),
                          padding='same',
                          kernel_initializer=he_init,
                          name="conv_f")
  net = tf.layers.batch_normalization( net, 
                                        training=mode_placeholder,
                                        name='bn_f')
  with tf.name_scope("relu_f"):
    net = tf.nn.relu(net)

  net = tf.layers.conv2d( inputs=net,
                          filters=model_settings['label_count'],
                          kernel_size=[1, 1],
                          padding='same',
                          kernel_initializer=he_init,
                          name="conv_l")

  ### Squeeze
  squeezed = tf.squeeze(net, axis=[1, 2], name="squeezed")

  if is_training:
    return squeezed, dropout_prob, mode_placeholder
  else:
    return squeezed, mode_placeholder

我的火车步骤如下所示：

update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
  optimizer = tf.train.AdamOptimizer(learning_rate=learning_rate_input)
  gvs = optimizer.compute_gradients(cross_entropy_mean)
  capped_gvs = [(tf.clip_by_value(grad, -2., 2.), var) for grad, var in gvs]
  train_step = optimizer.apply_gradients(gvs))

在训练期间，我正在为图表提供：

train_summary, train_accuracy, cross_entropy_value, _, _ = sess.run(
    [
        merged_summaries, evaluation_step, cross_entropy_mean, train_step,
        increment_global_step
    ],
    feed_dict={
        fingerprint_input: train_fingerprints,
        ground_truth_input: train_ground_truth,
        learning_rate_input: learning_rate_value,
        dropout_prob: 0.5,
        mode_placeholder: True
    })

在验证过程中，

validation_summary, validation_accuracy, conf_matrix = sess.run(
                [merged_summaries, evaluation_step, confusion_matrix],
                feed_dict={
                    fingerprint_input: validation_fingerprints,
                    ground_truth_input: validation_ground_truth,
                    dropout_prob: 1.0,
                    mode_placeholder: False
                })

我的损失和准确率曲线（橙色是训练，蓝色是验证）：损失与迭代次数的关系图，准确率与迭代次数的关系图

验证损失（和准确性）似乎非常不稳定。我的批量标准化的实现是错误的吗？或者这对于批量标准化是正常的，我应该等待更多的迭代？

score 1 · Accepted Answer

您需要将 is_training 传递给，tf.layers.batch_normalization(..., training=is_training) 否则它会尝试使用小批量统计信息而不是训练统计信息来规范化推理小批量，这是错误的。

score 0 · Accepted Answer

主要有两点需要检查。

1.您确定您在训练操作中正确使用了批量标准化（BN）吗？

如果您阅读图层文档：

注意：训练时，moving_mean 和moving_variance 需要更新。默认情况下，更新操作放置在中tf.GraphKeys.UPDATE_OPS，因此需要将它们作为依赖项添加到train_op. 此外，请务必在获取 update_ops 集合之前添加任何 batch_normalization 操作。否则，update_ops 将为空，训练/推理将无法正常工作。

例如：

x_norm = tf.layers.batch_normalization(x, training=training)

# ...
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS)
with tf.control_dependencies(update_ops):
     train_op = optimizer.minimize(loss)

2. 否则，尝试降低 BN 中的“动量”。

事实上，在训练过程中，BN 使用了均值和方差的两个移动平均值，它们应该近似于总体统计数据。均值和方差分别初始化为 0 和 1，然后逐步将它们乘以动量值（默认为 0.99）并加上新值*0.01。在推理（测试）时，标准化使用这些统计数据。出于这个原因，这些值需要一点时间才能得出数据的“真实”均值和方差。

资源：

https://www.tensorflow.org/api_docs/python/tf/layers/batch_normalization

https://github.com/keras-team/keras/issues/7265

https://github.com/keras-team/keras/issues/3366

原始 BN 论文可以在这里找到：

https://arxiv.org/abs/1502.03167