tensorflow - 在对象检测 API 中使用数据增强选项时出错

Question

我正在尝试使用 .config 文件中的 data_augmentation_options 来训练网络，特别是 ssd_mobilenet_v1，但是当我激活选项 random_adjust_brightness 时，我很快就会收到下面粘贴的错误消息（我在步骤 110000 之后激活了该选项）。

我尝试减少默认值：

optional float max_delta=1 [default=0.2];

但结果是一样的。

知道为什么吗？这些图像是来自 png 文件的 RGB（来自Bosch Small Traffic Lights Dataset）。

INFO:tensorflow:global step 110011: loss = 22.7990 (0.357 sec/step)
INFO:tensorflow:global step 110012: loss = 47.8811 (0.401 sec/step)
2017-11-16 11:02:29.114785: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](total_loss)]]
2017-11-16 11:02:29.114895: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](total_loss)]]
2017-11-16 11:02:29.114969: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](total_loss)]]
2017-11-16 11:02:29.115043: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: LossTensor is inf or nan. : Tensor had NaN values
     [[Node: CheckNumerics = CheckNumerics[T=DT_FLOAT, message="LossTensor is inf or nan.", _device="/job:localhost/replica:0/task:0/device:CPU:0"](total_loss)]]
2017-11-16 11:02:29.115112: W tensorflow/core/framework/op_kernel.cc:1192] Invalid argument: LossTensor is inf or nan. : Tensor had NaN values
...

编辑：我发现的解决方法是这样的。inf 或 nan 处于丢失状态，因此检查 /object_detection/core/preprocessor.py 中的函数进行亮度随机化：

def random_adjust_brightness(image, max_delta=0.2):
  """Randomly adjusts brightness.

  Makes sure the output image is still between 0 and 1.

  Args:
    image: rank 3 float32 tensor contains 1 image -> [height, width, channels]
           with pixel values varying between [0, 1].
    max_delta: how much to change the brightness. A value between [0, 1).

  Returns:
    image: image which is the same shape as input image.
    boxes: boxes which is the same shape as input boxes.
  """
  with tf.name_scope('RandomAdjustBrightness', values=[image]):
    image = tf.image.random_brightness(image, max_delta)
    image = tf.clip_by_value(image, clip_value_min=0.0, clip_value_max=1.0)
    return image

假设图像值必须在 0.0 和 1.0 之间。图像实际上是否有可能以 0 均值甚至不同的范围到达？在这种情况下，剪辑会破坏它们并导致失败。长话短说：我注释掉了剪切线并且它正在工作（我们将看到结果）。

score 1 · Accepted Answer

通常，获取LossTensor is inf or nan. : Tensor had NaN values是由于边界框/注释中的错误（来源：https ://github.com/tensorflow/models/issues/1881 ）。

我知道博世小型交通灯数据集有一些注释超出了图像尺寸。例如，该数据集中的图像高度为 720 像素，但某些边界框的高度坐标大于 720。这很常见，因为每当记录序列的汽车在红绿灯下行驶时，一些红绿灯是可见的，并且其中一些被切断。

我知道这不是您问题的确切答案，但希望它能提供有关您遇到问题的可能原因的见解。也许删除超出图像尺寸的注释将有助于解决问题；但是，我正在处理同样的问题，除了我没有使用图像预处理。在同一个数据集上，我LossTensor is inf or nan. : Tensor had NaN values每 8000 步就会遇到一次错误。

score 0 · Accepted Answer

我也遇到了这个问题，我最终写了一个快速而肮脏的脚本来找到坏蛋。我不知道图像集是否会随着时间而变化，但我下载的图像集包含三个错误的注释图像。
./rgb/train/2015-10-05-11-26-32_bag/105870.png

./rgb/train/2015-10-05-11-26-32_bag/108372.png

./rgb/train/2015-10-05-14-40-46_bag/462350.png

对于那些感兴趣的人，这是我的脚本：

import yaml
import os

INPUT_YAML = "train.yaml"
examples = yaml.load(open(INPUT_YAML, 'rb').read())
len_examples = len(examples)
print("Loaded ", len(examples), "examples")
for example in examples:
  for box in example['boxes']:
    xmin = float(box['x_min'])
    xmax = float(box['x_max'])
    ymin = float(box['y_min'])
    ymax = float(box['y_max'])
    if xmax < xmin or xmax > 1280 or xmin > 1280:
      print( "INVALID IMAGE: ", example['path'], " X_MAX = ", float(box['x_max']) )
    if ymax < ymin or ymax > 720 or ymin > 720:
      print( "INVALID IMAGE: ", example['path'], " Y_MAX = ", float(box['y_max']) )

score 0 · Accepted Answer

除了超出图像尺寸的注释外，博世交通灯检测训练数据集还有一张图像，其中 x_max < x_min 和 y_max < y_min会导致负宽度和高度。这会导致“LossTensor is inf or nan. : Tensor has NaN values”错误，每约 8000 步。我有同样的错误；删除有问题的条目解决了这个问题。

tensorflow - 在对象检测 API 中使用数据增强选项时出错

3 回答 3

Related

Reference