6

假设我有一个非常小的数据集,只有 50 张图像。我想重用Red Pill教程中的代码,但在每批训练中对同一组图像应用随机变换,比如随机改变亮度、对比度等。我只添加了一个函数:

def preprocessImages(x):
    retValue = numpy.empty_like(x)
    for i in range(50):
        image = x[i]
        image = tf.reshape(image, [28,28,1])
        image = tf.image.random_brightness(image, max_delta=63)
        #image = tf.image.random_contrast(image, lower=0.2, upper=1.8)
        # Subtract off the mean and divide by the variance of the pixels.
        float_image = tf.image.per_image_whitening(image)
        float_image_Mat = sess.run(float_image)
        retValue[i] = float_image_Mat.reshape((28*28))
    return retValue

对旧代码的小改动:

batch = mnist.train.next_batch(50)
for i in range(1000):
  #batch = mnist.train.next_batch(50)
  if i%100 == 0:
    train_accuracy = accuracy.eval(feed_dict={
        x:preprocessImages(batch[0]), y_: batch[1], keep_prob: 1.0})
    print("step %d, training accuracy %g"%(i, train_accuracy))
  train_step.run(feed_dict={x: preprocessImages(batch[0]), y_: batch[1], keep_prob: 0.5})

第一次迭代成功,然后崩溃:

step 0, training accuracy 0.02
W tensorflow/core/common_runtime/executor.cc:1027] 0x117e76c0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
     [[Node: gradients_4/Relu_12_grad/Relu_12/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_16)]]
W tensorflow/core/common_runtime/executor.cc:1027] 0x117e76c0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
     [[Node: gradients_4/Relu_13_grad/Relu_13/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_17)]]
W tensorflow/core/common_runtime/executor.cc:1027] 0x117e76c0 Compute status: Invalid argument: ReluGrad input is not finite. : Tensor had NaN values
     [[Node: gradients_4/Relu_14_grad/Relu_14/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_18)]]
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/media/sf_Data/mnistConv.py", line 69, in <module>
    train_step.run(feed_dict={x: preprocessImages(batch[0]), y_: batch[1], keep_prob: 0.5})
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1267, in run
    _run_using_default_session(self, feed_dict, self.graph, session)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2763, in _run_using_default_session
    session.run(operation, feed_dict)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 345, in run
    results = self._do_run(target_list, unique_fetch_targets, feed_dict_string)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/client/session.py", line 419, in _do_run
    e.code)
tensorflow.python.framework.errors.InvalidArgumentError: ReluGrad input is not finite. : Tensor had NaN values
     [[Node: gradients_4/Relu_12_grad/Relu_12/CheckNumerics = CheckNumerics[T=DT_FLOAT, message="ReluGrad input is not finite.", _device="/job:localhost/replica:0/task:0/cpu:0"](add_16)]]
Caused by op u'gradients_4/Relu_12_grad/Relu_12/CheckNumerics', defined at:
  File "<stdin>", line 1, in <module>
  File "/media/sf_Data/mnistConv.py", line 58, in <module>
    train_step = tf.train.AdamOptimizer(1e-4).minimize(cross_entropy)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 165, in minimize
    gate_gradients=gate_gradients)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/training/optimizer.py", line 205, in compute_gradients
    loss, var_list, gate_gradients=(gate_gradients == Optimizer.GATE_OP))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gradients.py", line 414, in gradients
    in_grads = _AsList(grad_fn(op_wrapper, *out_grads))
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 107, in _ReluGrad
    t = _VerifyTensor(op.inputs[0], op.name, "ReluGrad input is not finite.")
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/nn_grad.py", line 100, in _VerifyTensor
    verify_input = array_ops.check_numerics(t, message=msg)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 48, in check_numerics
    name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 633, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1710, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 988, in __init__
    self._traceback = _extract_stack()

...which was originally created as op u'Relu_12', defined at:
  File "<stdin>", line 1, in <module>
  File "/media/sf_Data/mnistConv.py", line 34, in <module>
    h_conv1 = tf.nn.relu(conv2d(x_image, W_conv1) + b_conv1)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_nn_ops.py", line 506, in relu
    return _op_def_lib.apply_op("Relu", features=features, name=name)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 633, in apply_op
    op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1710, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 988, in __init__
    self._traceback = _extract_stack()

这与我在包含 50 个训练示例的个人数据集上遇到的错误完全相同。

4

2 回答 2

4

首先要开始的一件事:使用合并运算符,而不是计算 y_conv 和交叉熵tf.softmax_cross_entropy_with_logits。这可能无法解决您的问题,但它比 Red Pill 示例中的幼稚版本在数值上更稳定。

其次,尝试在每次迭代时打印出 cross_entropy。

cross_entropy = .... (previous code here)
cross_entropy = tf.Print(cross_entropy, [cross_entropy], "Cross-entropy: ")

了解它是否会随着模型的进展而趋于无穷大,或者它是否只是跳到 inf 或 NaN。如果它逐渐爆炸,那么它可能是学习率。如果它跳跃,它可能是一个可以如上所述解决的数值边界条件。如果它从一开始就存在,那么您在应用失真的方式上可能会出现错误,最终以某种方式输入严重损坏的数据。

于 2015-12-12T00:09:38.993 回答
-5

按照链接:https ://github.com/tensorflow/tensorflow/issues/323#issuecomment-165855633解决了这个问题

于 2015-12-21T06:33:18.690 回答