我使用 Google 对象检测 API 训练自己的对象检测模型。一切正常,正在训练,像这样
2017-10-24 17:40:50.579603: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties:
name: GeForce GTX 1050 Ti
major: 6 minor: 1 memoryClockRate (GHz) 1.392
pciBusID 0000:01:00.0
Total memory: 3.94GiB
Free memory: 3.55GiB
2017-10-24 17:40:50.579617: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0
2017-10-24 17:40:50.579621: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0: Y
2017-10-24 17:40:50.579627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: GeForce GTX 1050 Ti, pci bus id: 0000:01:00.0)
2017-10-24 17:40:51.234252: I tensorflow/core/common_runtime/simple_placer.cc:675] Ignoring device specification /device:GPU:0 for node 'prefetch_queue_Dequeue' because the input edge from 'prefetch_queue' is a reference connection and already has a device field set to /device:CPU:0
INFO:tensorflow:Restoring parameters from ssd_mobilenet_v1_coco_11_06_2017/model.ckpt
INFO:tensorflow:Starting Session.
INFO:tensorflow:Saving checkpoint to path training/model/model.ckpt
INFO:tensorflow:Starting Queues.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:Recording summary at step 0.
INFO:tensorflow:global step 1: loss = 14.9167 (3.799 sec/step)
INFO:tensorflow:global step 2: loss = 12.3885 (1.003 sec/step)
INFO:tensorflow:global step 3: loss = 11.5575 (0.825 sec/step)
2017-10-24 17:41:00.695594: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:247] PoolAllocator: After 7141 get requests, put_count=7131 evicted_count=1000 eviction_rate=0.140233 and unsatisfied allocation rate=0.15544
2017-10-24 17:41:00.695684: I tensorflow/core/common_runtime/gpu/pool_allocator.cc:259] Raising pool_size_limit_ from 100 to 110
INFO:tensorflow:global step 4: loss = 10.8721 (0.772 sec/step)
INFO:tensorflow:global step 5: loss = 10.2290 (0.790 sec/step)
INFO:tensorflow:global step 6: loss = 9.5224 (0.799 sec/step)
INFO:tensorflow:global step 7: loss = 9.3629 (0.797 sec/step)
INFO:tensorflow:global step 8: loss = 9.1755 (0.847 sec/step)
INFO:tensorflow:global step 9: loss = 8.3156 (0.788 sec/step)
INFO:tensorflow:global step 10: loss = 8.2479 (0.817 sec/step)
INFO:tensorflow:global step 11: loss = 7.8164 (0.762 sec/step)
INFO:tensorflow:global step 12: loss = 7.5391 (0.769 sec/step)
INFO:tensorflow:global step 13: loss = 6.9219 (0.790 sec/step)
INFO:tensorflow:global step 14: loss = 6.9487 (0.781 sec/step)
INFO:tensorflow:global step 15: loss = 6.6061 (0.793 sec/step)
INFO:tensorflow:global step 16: loss = 6.3786 (0.813 sec/step)
INFO:tensorflow:global step 17: loss = 6.1362 (0.757 sec/step)
INFO:tensorflow:global step 18: loss = 6.1345 (0.766 sec/step)
INFO:tensorflow:global step 19: loss = 6.3627 (0.754 sec/step)
INFO:tensorflow:global step 20: loss = 6.1240 (0.775 sec/step)
INFO:tensorflow:global step 21: loss = 6.0264 (0.750 sec/step)
INFO:tensorflow:global step 22: loss = 5.6904 (0.747 sec/step)
INFO:tensorflow:global step 23: loss = 4.7453 (0.751 sec/step)
INFO:tensorflow:global step 24: loss = 4.7063 (0.766 sec/step)
INFO:tensorflow:global step 25: loss = 5.0677 (0.828 sec/step)
但是经过一些步骤后,发生了OOM错误。
INFO:tensorflow:global step 5611: loss = 1.2254 (0.780 sec/step)
INFO:tensorflow:global step 5612: loss = 0.8521 (0.755 sec/step)
INFO:tensorflow:global step 5613: loss = 1.5406 (0.786 sec/step)
INFO:tensorflow:global step 5614: loss = 1.3886 (0.748 sec/step)
INFO:tensorflow:global step 5615: loss = 1.2802 (0.740 sec/step)
INFO:tensorflow:global step 5616: loss = 0.9879 (0.755 sec/step)
INFO:tensorflow:global step 5617: loss = 0.9560 (0.774 sec/step)
INFO:tensorflow:global step 5618: loss = 1.0467 (0.755 sec/step)
INFO:tensorflow:global step 5619: loss = 1.2808 (0.763 sec/step)
INFO:tensorflow:global step 5620: loss = 1.3788 (0.753 sec/step)
INFO:tensorflow:global step 5621: loss = 1.1395 (0.727 sec/step)
INFO:tensorflow:global step 5622: loss = 1.2390 (0.751 sec/step)
2017-10-24 18:53:05.076122: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 4.00MiB. Current allocation summary follows.
2017-10-24 18:53:05.076191: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256): Total Chunks: 2, Chunks in use: 0 512B allocated for chunks. 8B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-24 18:53:05.076214: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-24 18:53:05.076245: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024): Total Chunks: 1, Chunks in use: 0 1.0KiB allocated for chunks. 4B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-24 18:53:05.076276: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048): Total Chunks: 4, Chunks in use: 0 8.0KiB allocated for chunks. 5.6KiB client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-24 18:53:05.076299: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-10-24 18:53:05.076324: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192): Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-req
我发现这可能是由于多 GPU 训练。
Caused by op 'Loss/ToInt32_60', defined at:
File "train.py", line 205, in <module>
tf.app.run()
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
_sys.exit(main(_sys.argv[:1] + flags_passthrough))
File "train.py", line 201, in main
worker_job_name, is_chief, FLAGS.train_dir)
File "/home/yuxin/Project/my_object_detection/object_detection/trainer.py", line 192, in train
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
File "/home/yuxin/Project/my_object_detection/slim/deployment/model_deploy.py", line 193, in create_clones
outputs = model_fn(*args, **kwargs)
File "/home/yuxin/Project/my_object_detection/object_detection/trainer.py", line 133, in _create_losses
losses_dict = detection_model.loss(prediction_dict)
File "/home/yuxin/Project/my_object_detection/object_detection/meta_architectures/ssd_meta_arch.py", line 431, in loss
location_losses, cls_losses, prediction_dict, match_list)
File "/home/yuxin/Project/my_object_detection/object_detection/meta_architectures/ssd_meta_arch.py", line 565, in _apply_hard_mining
match_list=match_list)
File "/home/yuxin/Project/my_object_detection/object_detection/core/losses.py", line 479, in __call__
self._min_negatives_per_image)
File "/home/yuxin/Project/my_object_detection/object_detection/core/losses.py", line 541, in _subsample_selection_to_desired_neg_pos_ratio
num_positives = tf.reduce_sum(tf.to_int32(positives_indicator))
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 770, in to_int32
return cast(x, dtypes.int32, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/math_ops.py", line 689, in cast
return gen_math_ops.cast(x, base_type, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/gen_math_ops.py", line 403, in cast
result = _op_def_lib.apply_op("Cast", x=x, DstT=DstT, name=name)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/op_def_library.py", line 767, in apply_op
op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
original_op=self._default_original_op, op_def=op_def)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
self._traceback = _extract_stack()
ResourceExhaustedError (see above for traceback): OOM when allocating tensor with shape[1917]
[[Node: Loss/ToInt32_60 = Cast[DstT=DT_INT32, SrcT=DT_BOOL, _device="/job:localhost/replica:0/task:0/gpu:0"](Loss/Gather_220/_8451)]]
Traceback (most recent call last):
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1139, in _do_call
return fn(*args)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
status, run_metadata)
File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
next(self.gen)
File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor with shape[1917]
[[Node: Loss/ToInt32_60 = Cast[DstT=DT_INT32, SrcT=DT_BOOL, _device="/job:localhost/replica:0/task:0/gpu:0"](Loss/Gather_220/_8451)]]
我使用对象检测 API代码进行训练。我只想使用单个 GPU 进行训练。
with tf.Graph().as_default():
# Build a configuration specifying multi-GPU and multi-replicas.
deploy_config = model_deploy.DeploymentConfig(
num_clones=num_clones,
clone_on_cpu=clone_on_cpu,
replica_id=task,
num_replicas=worker_replicas,
num_ps_tasks=ps_tasks,
worker_job_name=worker_job_name)
# Place the global step on the device storing the variables.
with tf.device(deploy_config.variables_device()):
global_step = slim.create_global_step()
with tf.device(deploy_config.inputs_device()):
input_queue = _create_input_queue(train_config.batch_size // num_clones,
create_tensor_dict_fn,
train_config.batch_queue_capacity,
train_config.num_batch_queue_threads,
train_config.prefetch_queue_capacity,
data_augmentation_options)
# Gather initial summaries.
summaries = set(tf.get_collection(tf.GraphKeys.SUMMARIES))
global_summaries = set([])
model_fn = functools.partial(_create_losses,
create_model_fn=create_model_fn)
clones = model_deploy.create_clones(deploy_config, model_fn, [input_queue])
first_clone_scope = clones[0].scope
# Gather update_ops from the first clone. These contain, for example,
# the updates for the batch_norm variables created by model_fn.
update_ops = tf.get_collection(tf.GraphKeys.UPDATE_OPS, first_clone_scope)
我知道减少 BatchSize 可以解决它。但是为什么在训练开始时可以,经过一些步骤后出现OOM错误。太感谢了。