python - softmax 的 TensorFlow 问题

Question

我有一个 Tensorflow 多类分类器，它正在生成nan或inf同时使用tf.nn.softmax. 请参阅以下代码片段（logits形状为batch_size x 6，因为我有 6 个类并且输出是单热编码的）。batch_size是 1024。

logits = tf.debugging.check_numerics(logits, message='bad logits', name=None)
probabilities = tf.nn.softmax(logits=logits, name='Softmax')
probabilities = tf.debugging.check_numerics(probabilities, message='bad probabilities', name=None)

分类器在最后一条语句上失败，因为它找到nan或inf。是干净的，否则第一个语句将失败。probabilitieslogits

从我读到的内容来看tf.nn.softmax，它可以处理 logits 中非常大和非常小的值。我已经在交互模式下验证了这一点。

>>> with tf.Session() as s:
...   a = tf.constant([[1000, 10], [-100, -200], [3, 4.0]])
...   sm = tf.nn.softmax(logits=a, name='Softmax')
...   print(a.eval())
...   print(sm.eval())
...
[[1000.   10.]
 [-100. -200.]
 [   3.    4.]]
[[1.         0.        ]
 [1.         0.        ]
 [0.26894143 0.7310586 ]]

然后我尝试将这些值剪掉，logits现在整个事情都可以工作了。请参阅下面的修改片段。

logits = tf.debugging.check_numerics(logits, message='logits', name=None)
safe_logits = tf.clip_by_value(logits, -15.0, 15.0)
probabilities = tf.nn.softmax(logits=safe_logits, name='Softmax')
probabilities = tf.debugging.check_numerics(probabilities, message='bad probabilities', name=None)

在第二个语句中，我将值裁剪logits为 -15 和 15，这以某种方式阻止了nan/inf在 softmax 计算中。所以，我能够解决手头的问题。

但是，我仍然不明白为什么这个剪辑有效？（我应该提到 -20 和 20 之间的裁剪不起作用，并且模型在nan或inf中失败probabilities）。

有人可以帮我理解为什么会这样吗？

我正在使用 tensorflow 1.15.0，在 64 位实例上运行。

score 3 · Accepted Answer

首先要看的是值本身，您已经这样做了。第二个要看的地方是渐变。即使值看起来合理，但如果梯度非常陡峭，反向传播最终会爆炸梯度和值。

例如，如果 logits 是由 log(x) 之类的东西生成的，那么 0.001 的 x 将生成 -6.9。看起来很良心。但是梯度是1000！这将在反向传播/正向传播期间迅速爆炸梯度和值。

# Pretend this is the source value that is fed to a function that generates the logit. 
>>> x = tf.Variable(0.001)

# Let's operate on the source value to generate the logit. 
>>> with tf.GradientTape() as tape:
...   y = tf.math.log(x)
... 

# The logit looks okay... -6.9. 
>>> y
<tf.Tensor: shape=(), dtype=float32, numpy=-6.9077554>

# But the gradient is exploding. 
>>> tape.gradient(y,x)
<tf.Tensor: shape=(), dtype=float32, numpy=999.99994>
>>>

剪裁 logit 似乎专注于生成更小的值以提供给 softmax，但这可能不是它有帮助的原因。（事实上，softmax 可以处理值为 tf.float32.max 的 logit 没有问题，所以 logit 的值不太可能是问题）。可能真正发生的情况是，当您剪辑到 15 时，您也将梯度设置为零，否则 logit 将是 20 并具有爆炸性梯度。所以剪裁值也会引入剪裁渐变。

# This is same source variable as above. 
>>> x = tf.Variable(0.001)

# Now let's operate with clipping. 
>>> with tf.GradientTape() as tape:
...   y = tf.clip_by_value(tf.math.log(x), -1., 1.)
... 

# The clipped logit still looks okay... 
>>> y
<tf.Tensor: shape=(), dtype=float32, numpy=-1.0>

# What may be more important is that the clipping has also zeroed out the gradient
>>> tape.gradient(y,x)
<tf.Tensor: shape=(), dtype=float32, numpy=0.0>

python - softmax 的 TensorFlow 问题

1 回答 1

Related

Reference