python - 在 Keras 中，使用 SGD，为什么 model.fit() 训练顺利，但逐步训练方法给出了爆炸梯度和损失

Question

因为这种爆炸梯度和爆炸损失发生在网络很大的时候，所以我就不费心把整个网络贴在这里了。但是我已经尽力了，在过去的两周里，我深入研究了源代码的细节来监控一些权重，手工编写更新步骤来监控损失、权重、更新、梯度和超参数以与内部进行比较地位。我想在我在这里问之前我已经做了一些功课。

问题是有两种使用 Keras API 的训练方法，is model.fit()， 2nd 是更定制的一种，用于更复杂的训练和网络，但是虽然我几乎所有东西都保持不变，model.fit()但没有爆炸损失，但是自定义方法爆炸。有趣的是，当我在一个小得多的网络下监控许多细节时，两种方法看起来都一样。

环境：

# tensorflow 1.14
import tensorflow as tf
from tensorflow.keras import backend as K

对于model.fit()方法：

# I skipped the details of the below two lines as I couldn't share the very details. but x is [10000, 32, 32, 3] image data, y is [10000, 10, 1] label. model is regular Keras model.

x_train, y_train, x_test, y_test = get_data()
model = get_keras_model()

loss_fn = tf.keras.losses.CategoricalCrossentropy()
sgd = tf.keras.optimizers.SGD(lr=.1, momentum=0.9, nesterov=True)

model.compile(loss=loss_fn, optimizer=sgd, metrics=['accuracy'])
history = model.fit(x_train, y_train, batch_size=128, epochs=100, validation_data=(x_test, y_test))

自定义方法：

x_train, y_train, x_test, y_test = get_data()
model = get_keras_model()

input = model.inputs[0]
y_true = tf.placeholder(dtype = tf.int32, shape = [None, 10])
y_pred = model.outputs[0]

loss_fn = tf.keras.losses.CategoricalCrossentropy()
loss = loss_fn(y_true, y_pred)
weights = model.trainable_weights
sgd = tf.keras.optimizers.SGD(lr=.1, momentum=0.9, nesterov=True)

training_updates = sgd.get_updates(loss, weights)
training_fn = K.function([y_true, input], [loss], training_updates)

num_train = 10000
steps_per_epoch = int(num_train / 128) # batch size 128
total_steps = steps_per_epoch * 100 # epoch 100

for step in total_steps:
    idx = np.random.randint(0, 10000, 128)
    input_img = x_train[idx]
    ground_true = y_train[idx]

    cur_loss = training_fn([ground_true, input_img])

简而言之，相同的模型、相同的损失函数、相同的优化器 SGD、相同的图像馈送（我确实控制图像馈送顺序，尽管这里的代码是从训练数据中随机选择的）。内部过程中是否有任何东西model.fit()可以防止损失或梯度爆炸？

score 0 · Accepted Answer

深挖源码后，找到了梯度爆炸的原因，正确的代码（最小改动如下）：

x_train, y_train, x_test, y_test = get_data()
model = get_keras_model()

input = model.inputs[0]
y_true = tf.placeholder(dtype = tf.int32, shape = [None, 10])
y_pred = model.outputs[0]

loss_fn = tf.keras.losses.CategoricalCrossentropy()
loss = loss_fn(y_true, y_pred)
weights = model.trainable_weights
sgd = tf.keras.optimizers.SGD(lr=.1, momentum=0.9, nesterov=True)

training_updates = sgd.get_updates(loss, weights)

# Correct:
training_fn = K.function([y_true, input, K.symbolic_learning_phase()], [loss], training_updates)

# Before:
# training_fn = K.function([y_true, input], [loss], training_updates)

num_train = 10000
steps_per_epoch = int(num_train / 128) # batch size 128
total_steps = steps_per_epoch * 100 # epoch 100

for step in total_steps:
    idx = np.random.randint(0, 10000, 128)
    input_img = x_train[idx]
    ground_true = y_train[idx]

    # Correct:
    cur_loss = training_fn([ground_true, input_img, True])

    # Before:
    # cur_loss = training_fn([ground_true, input_img])

我对这个特定张量的理解K.symbolic_learning_phase()是它具有要设置的默认值False（如果您在初始化时检查源代码），BatchNormalization并且Dropout层等在训练阶段和测试阶段的行为不同。在这种情况下，BatchNormalization层是梯度爆炸的原因（现在有些帖子提到他们得到梯度爆炸是有道理的BatchNormalization）这是因为它的两个可训练权重batch_normalization_1/gamma:0取决于batch_normalization_1/beta:0这个张量，并且使用默认值False他们没有学习和他们的在训练过程中，重量变得nan非常快。

我注意到没有多少使用这种training_updates方法的 Keras 代码真正放入K.symbolic_learning_phase()他们的代码中，但是，这是 Keras 的 API 在幕后所做的。

python - 在 Keras 中，使用 SGD，为什么 model.fit() 训练顺利，但逐步训练方法给出了爆炸梯度和损失

1 回答 1

Related

Reference