python - 用于自定义数据集的 Tensorflow 对象检测 API - 在训练期间被杀死

Question

在自定义数据集（店面图像）上进行对象检测训练，针对单个类（总共 285 个图像），在 CPU 上本地运行，几个步骤后8GB RAM 被杀死。

我正在关注这个博客作为参考。

这是控制台日志

(tensorflow) rajaram@rajaram-Lenovo-ideapad-110-15ISK:~/tensorflow/models$ python object_detection/train.py \
>     --logtostderr \
>     --pipeline_config_path=/home/rajaram/tensorflow/models/object_detection/models/sf_od_model/ssd_mobilenet_v1_sf_train.config \
>     --train_dir=/home/rajaram/tensorflow/models/object_detection/models/sf_od_model/train
INFO:tensorflow:Summary name Learning Rate is illegal; using Learning_Rate instead.
WARNING:tensorflow:From /home/rajaram/tensorflow/models/object_detection/meta_architectures/ssd_meta_arch.py:607: all_variables (from tensorflow.python.ops.variables) is deprecated and will be removed after 2017-03-02.
Instructions for updating:
Please use tf.global_variables instead.
INFO:tensorflow:Summary name /clone_loss is illegal; using clone_loss instead.
2017-09-26 22:15:08.121785: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-26 22:15:08.122313: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-26 22:15:08.123308: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-09-26 22:15:08.124144: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-09-26 22:15:08.124658: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-09-26 22:15:08.953929: I tensorflow/core/common_runtime/simple_placer.cc:697] Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0
INFO:tensorflow:Restoring parameters from /home/rajaram/tensorflow/models/object_detection/models/sf_od_model/ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path /home/rajaram/tensorflow/models/object_detection/models/sf_od_model/train/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Saving checkpoint to path /home/rajaram/tensorflow/models/object_detection/models/sf_od_model/train/model.ckpt
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:global_step/sec: 0.00238991
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:global step 1: loss = 14.4365 (801.196 sec/step)
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 1.
INFO:tensorflow:global step 2: loss = 12.9940 (173.981 sec/step)
INFO:tensorflow:Recording summary at step 2.
INFO:tensorflow:Recording summary at step 3.
INFO:tensorflow:global step 3: loss = 12.4866 (166.656 sec/step)
INFO:tensorflow:Saving checkpoint to path /home/rajaram/tensorflow/models/object_detection/models/sf_od_model/train/model.ckpt
INFO:tensorflow:Recording summary at step 3.
INFO:tensorflow:Saving checkpoint to path /home/rajaram/tensorflow/models/object_detection/models/sf_od_model/train/model.ckpt
INFO:tensorflow:global step 4: loss = 11.2386 (162.260 sec/step)
INFO:tensorflow:Recording summary at step 4.
INFO:tensorflow:Recording summary at step 4.
INFO:tensorflow:Recording summary at step 5.
INFO:tensorflow:global step 5: loss = 10.8210 (416.903 sec/step)
INFO:tensorflow:Recording summary at step 5.
Killed
(tensorflow) rajaram@rajaram-Lenovo-ideapad-110-15ISK:~/tensorflow/models$

我的想法和问题

1)图像尺寸有问题吗？- 我的图像分布如下：<= 400x300 (5%)、400x300 和 640x480 之间 (22%)、640x480 和 800x600 (63%) 和 > 800x600 (22%)。尽管尺寸约为 400x300 的图像足以识别购物板，但在我的数据集中存在更高分辨率的倾向，因为下一步是在这些板上进行文本识别。

这种想法对吗？
我是否应该将图像调整为更小的尺寸（如果是 - 什么尺寸合适）并在重新开始整个过程之前重新进行注释？

我可以训练 Oxford-IIIT Pet 数据（约 7.9k 图像 - 耗时约 13 小时）2000 步（配置文件的 train_config 部分中的 num_steps = 2000），而不会崩溃或被杀死。所以，我认为只有 285 个图像应该能够在 CPU 本身上运行。

2）交换内存有问题吗？- 我还检查了其他类似的帖子（增加交换空间建议，没有后续和另一个增加交换内存建议），但因为我可以在我当前的系统设置上训练 Oxford-IIIT 宠物数据集，训练为少至 285 张图像不应终止该过程。

我的想法正确吗？
如果不是，这确实是一个解决方案，那么我需要指针和明确的步骤来做到这一点。

我想知道出了什么问题并让它在本地运行。我希望我已经提供了足够的信息来获得帮助。如果没有，请告诉我需要什么。

                         ---------------------------

系统信息

您正在使用的模型的顶级目录是什么： tensorflow/models（尚未更新到新的文件夹结构）
我是否编写了自定义代码（而不是使用 TensorFlow 中提供的股票示例脚本）：是 - 最小的更改（按照我自己的数据集的 Dat Trans 模板 - Github）
操作系统平台和发行版（例如，Linux Ubuntu 16.04）： Ubuntu 16.04.3 LTS
TensorFlow 安装自（源代码或二进制文件）：二进制文件（进入虚拟环境）
TensorFlow 版本（使用下面的命令）： 1.3.0
Bazel 版本（如果从源代码编译）： NA
CUDA/cuDNN 版本：不适用
GPU 型号和内存： NA
重现的确切命令： python object_detection/train.py --logtostderr --pipeline_config_path=/home/rajaram/tensorflow/models/object_detection/models/sf_od_model/ssd_mobilenet_v1_sf_train.config --train_dir=/home/rajaram/tensorflow/models/object_detection /models/sf_od_model/train

python - 用于自定义数据集的 Tensorflow 对象检测 API - 在训练期间被杀死

我的想法和问题

系统信息

0 回答 0

Related

Reference