1
2017-07-07 14:21:28.793025: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-07 14:21:28.793037: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-07 14:21:28.793040: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2017-07-07 14:21:28.793042: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2 instructions, but these are available on your machine and could speed up CPU computations.
2017-07-07 14:21:28.793044: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2017-07-07 14:21:28.953864: I tensorflow/core/common_runtime/gpu/gpu_device.cc:940] Found device 0 with properties: 
name: Quadro M2000
major: 5 minor: 2 memoryClockRate (GHz) 1.1625
pciBusID 0000:01:00.0
Total memory: 3.93GiB
Free memory: 30.00MiB
2017-07-07 14:21:28.953885: I tensorflow/core/common_runtime/gpu/gpu_device.cc:961] DMA: 0 
2017-07-07 14:21:28.953890: I tensorflow/core/common_runtime/gpu/gpu_device.cc:971] 0:   Y 
2017-07-07 14:21:28.953896: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Quadro M2000, pci bus id: 0000:01:00.0)
2017-07-07 14:21:28.957332: E tensorflow/stream_executor/cuda/cuda_driver.cc:924] failed to allocate 30.00M (31457280 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
2017-07-07 14:21:39.936797: W tensorflow/core/common_runtime/bfc_allocator.cc:273] Allocator (GPU_0_bfc) ran out of memory trying to allocate 60.00MiB.  Current allocation summary follows.
2017-07-07 14:21:39.936839: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (256):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936851: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (512):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936860: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1024):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936869: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2048):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936878: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4096):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936887: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8192):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936895: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16384):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936904: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (32768):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936912: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (65536):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936922: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (131072):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936930: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (262144):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936939: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (524288):    Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936947: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (1048576):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936956: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (2097152):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936965: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (4194304):   Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936976: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (8388608):   Total Chunks: 1, Chunks in use: 0 9.91MiB allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936985: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (16777216):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.936996: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (33554432):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.937004: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (67108864):  Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.937013: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (134217728):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.937022: I tensorflow/core/common_runtime/bfc_allocator.cc:643] Bin (268435456):     Total Chunks: 0, Chunks in use: 0 0B allocated for chunks. 0B client-requested for chunks. 0B in use in bin. 0B client-requested in use in bin.
2017-07-07 14:21:39.937031: I tensorflow/core/common_runtime/bfc_allocator.cc:660] Bin for 60.00MiB was 32.00MiB, Chunk State: 
2017-07-07 14:21:39.937040: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c0000 of size 1280
2017-07-07 14:21:39.937047: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c0500 of size 256
2017-07-07 14:21:39.937053: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c0600 of size 256
2017-07-07 14:21:39.937059: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c0700 of size 512
2017-07-07 14:21:39.937065: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c0900 of size 256
2017-07-07 14:21:39.937071: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c0a00 of size 256
2017-07-07 14:21:39.937076: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c0b00 of size 1024
2017-07-07 14:21:39.937082: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c0f00 of size 256
2017-07-07 14:21:39.937088: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c1000 of size 256
2017-07-07 14:21:39.937094: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c1100 of size 1536
2017-07-07 14:21:39.937099: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c1700 of size 256
2017-07-07 14:21:39.937105: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c1800 of size 256
2017-07-07 14:21:39.937111: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c1900 of size 1536
2017-07-07 14:21:39.937116: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c1f00 of size 256
2017-07-07 14:21:39.937122: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c2000 of size 256
2017-07-07 14:21:39.937127: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c2100 of size 1024
2017-07-07 14:21:39.937133: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c2500 of size 256
2017-07-07 14:21:39.937138: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c2600 of size 256
2017-07-07 14:21:39.937144: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c2700 of size 16384
2017-07-07 14:21:39.937150: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c6700 of size 256
2017-07-07 14:21:39.937155: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c6800 of size 256
2017-07-07 14:21:39.937161: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031c6900 of size 68096
2017-07-07 14:21:39.937167: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d7800 of size 256
2017-07-07 14:21:39.937195: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d7900 of size 256
2017-07-07 14:21:39.937201: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d7a00 of size 256
2017-07-07 14:21:39.937206: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d7b00 of size 256
2017-07-07 14:21:39.937212: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d7c00 of size 256
2017-07-07 14:21:39.937217: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d7d00 of size 256
2017-07-07 14:21:39.937223: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d7e00 of size 256
2017-07-07 14:21:39.937228: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d7f00 of size 256
2017-07-07 14:21:39.937249: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d8000 of size 256
2017-07-07 14:21:39.937253: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d8100 of size 256
2017-07-07 14:21:39.937257: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d8200 of size 256
2017-07-07 14:21:39.937261: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d8300 of size 256
2017-07-07 14:21:39.937265: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d8400 of size 256
2017-07-07 14:21:39.937268: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13031d8500 of size 256
2017-07-07 14:21:39.937272: I tensorflow/core/common_runtime
2017-07-07 14:21:39.937301: I tensorflow/core/common_runtime
2017-07-07 14:21:39.937310: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x13037b3600 of size 5308416
2017-07-07 14:21:39.937314: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1303cc3600 of size 1536
2017-07-07 14:21:39.937318: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1303cc3c00 of size 3538944
2017-07-07 14:21:39.937322: I tensorflow/core/common_runtime/bfc_allocator.cc:678] Chunk at 0x1304023c00 of size 1024
2017-07-07 14:21:39.937327: I tensorflow/core/common_runtime/bfc_allocator.cc:687] Free at 0x1304024000 of size 10390784
2017-07-07 14:21:39.937331: I tensorflow/core/common_runtime/bfc_allocator.cc:693]      Summary of in-use Chunks by size: 
2017-07-07 14:21:39.937337: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 34 Chunks of size 256 totalling 8.5KiB
2017-07-07 14:21:39.937342: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 3 Chunks of size 512 totalling 1.5KiB
2017-07-07 14:21:39.937347: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 1024 totalling 4.0KiB
2017-07-07 14:21:39.937353: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 1280 totalling 1.2KiB
2017-07-07 14:21:39.937357: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 4 Chunks of size 1536 totalling 6.0KiB
2017-07-07 14:21:39.937362: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 16384 totalling 16.0KiB
2017-07-07 14:21:39.937368: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 68096 totalling 66.5KiB
2017-07-07 14:21:39.937373: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 139520 totalling 136.2KiB
2017-07-07 14:21:39.937378: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 2457600 totalling 2.34MiB
2017-07-07 14:21:39.937383: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 2 Chunks of size 3538944 totalling 6.75MiB
2017-07-07 14:21:39.937388: I tensorflow/core/common_runtime/bfc_allocator.cc:696] 1 Chunks of size 5308416 totalling 5.06MiB
2017-07-07 14:21:39.937393: I tensorflow/core/common_runtime/bfc_allocator.cc:700] Sum Total of in-use chunks: 14.39MiB
2017-07-07 14:21:39.937401: I tensorflow/core/common_runtime/bfc_allocator.cc:702] Stats: 
Limit:                    31457280
InUse:                    15089664
MaxInUse:                 15089664
NumAllocs:                      53
MaxAllocSize:              5308416

2017-07-07 14:21:39.937412: W tensorflow/core/common_runtime/bfc_allocator.cc:277] ************************************************************________________________________________
2017-07-07 14:21:39.937433: W tensorflow/core/framework/op_kernel.cc:1148] Resource exhausted: OOM when allocating tensor of shape [3840,4096] and type float
2017-07-07 14:21:39.955389: E tensorflow/core/common_runtime/executor.cc:644] Executor failed to create kernel. Resource exhausted: OOM when allocating tensor of shape [3840,4096] and type float
     [[Node: coarse6/weights/Adam/Initializer/zeros = Const[_class=["loc:@coarse6/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: [3840,4096] values: [0 0 0]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]
Traceback (most recent call last):
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1139, in _do_call
    return fn(*args)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1121, in _run_fn
    status, run_metadata)
  File "/usr/lib/python3.5/contextlib.py", line 66, in __exit__
    next(self.gen)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/errors_impl.py", line 466, in raise_exception_on_not_ok_status
    pywrap_tensorflow.TF_GetCode(status))
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [3840,4096] and type float
     [[Node: coarse6/weights/Adam/Initializer/zeros = Const[_class=["loc:@coarse6/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: [3840,4096] values: [0 0 0]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/home/varun/Desktop/Depth_Estimation/task.py", line 145, in <module>
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/varun/Desktop/Depth_Estimation/task.py", line 141, in main

  File "/home/varun/Desktop/Depth_Estimation/task.py", line 58, in train

  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 789, in run
    run_metadata_ptr)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 997, in _run
    feed_dict_string, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1132, in _do_run
    target_list, options, run_metadata)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/client/session.py", line 1152, in _do_call
    raise type(e)(node_def, op, message)
tensorflow.python.framework.errors_impl.ResourceExhaustedError: OOM when allocating tensor of shape [3840,4096] and type float
     [[Node: coarse6/weights/Adam/Initializer/zeros = Const[_class=["loc:@coarse6/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: [3840,4096] values: [0 0 0]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

Caused by op 'coarse6/weights/Adam/Initializer/zeros', defined at:
  File "/home/varun/Desktop/Depth_Estimation/task.py", line 145, in <module>
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/platform/app.py", line 48, in run
    _sys.exit(main(_sys.argv[:1] + flags_passthrough))
  File "/home/varun/Desktop/Depth_Estimation/task.py", line 141, in main
  File "/home/varun/Desktop/Depth_Estimation/task.py", line 37, in train
    train_op = op.train(loss, global_step, BATCH_SIZE)
  File "/home/varun/Desktop/Depth_Estimation/train_operation.py", line 36, in train
    apply_gradient_op = opt.apply_gradients(grads, global_step=global_step)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 446, in apply_gradients
    self._create_slots([_get_variable_for(v) for v in var_list])
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/adam.py", line 128, in _create_slots
    self._zeros_slot(v, "m", self._name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/optimizer.py", line 766, in _zeros_slot
    named_slots[_var_key(var)] = slot_creator.create_zeros_slot(var, op_name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/slot_creator.py", line 174, in create_zeros_slot
    colocate_with_primary=colocate_with_primary)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/slot_creator.py", line 146, in create_slot_with_initializer
    dtype)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/training/slot_creator.py", line 66, in _create_slot_var
    validate_shape=validate_shape)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 1065, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 962, in get_variable
    use_resource=use_resource, custom_getter=custom_getter)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 367, in get_variable
    validate_shape=validate_shape, use_resource=use_resource)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 352, in _true_getter
    use_resource=use_resource)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 725, in _get_single_variable
    validate_shape=validate_shape)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variables.py", line 200, in __init__
    expected_shape=expected_shape)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variables.py", line 278, in _init_from_args
    initial_value(), name="initial_value", dtype=dtype)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/variable_scope.py", line 701, in <lambda>
    shape.as_list(), dtype=dtype, partition_info=partition_info)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/init_ops.py", line 93, in __call__
    return array_ops.zeros(shape, dtype)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/ops/array_ops.py", line 1383, in zeros
    output = constant(zero, shape=shape, dtype=dtype, name=name)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/constant_op.py", line 106, in constant
    attrs={"value": tensor_value, "dtype": dtype_value}, name=name).outputs[0]
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 2506, in create_op
    original_op=self._default_original_op, op_def=op_def)
  File "/usr/local/lib/python3.5/dist-packages/tensorflow/python/framework/ops.py", line 1269, in __init__
    self._traceback = _extract_stack()

ResourceExhaustedError (see above for traceback): OOM when allocating tensor of shape [3840,4096] and type float
     [[Node: coarse6/weights/Adam/Initializer/zeros = Const[_class=["loc:@coarse6/weights"], dtype=DT_FLOAT, value=Tensor<type: float shape: [3840,4096] values: [0 0 0]...>, _device="/job:localhost/replica:0/task:0/gpu:0"]()]]

由于资源分配错误,培训无法启动,我已检查了有关此问题的所有帖子。我该如何解决?我还尝试使用 TFFRCNN 中发布的 BFC 分配器 GPU 解决方案。我已将其添加到下面的培训代码中作为评论。还在这里检查了 Yaroslav Bulatov 提出的问题:https ://github.com/CharlesShang/TFFRCNN/issues/68

如果需要任何更改,任何人都可以帮助我修改代码吗?我也尝试减少批量大小,并尝试在 GPU 训练服务器上运行它。我无法修复它。

我正在使用来自https://github.com/MasazI/cnn_depth_tensorflow的代码 请查看上述链接中的 train_operation.py 文件。我只修改了task.py

我的培训代码:

from datetime import datetime
    from tensorflow.python.platform import gfile
    import numpy as np
    import tensorflow as tf
    from dataset import DataSet
    from dataset import output_predict
    import model
    import train_operation as op

    MAX_STEPS = 10000000
    LOG_DEVICE_PLACEMENT = False
    BATCH_SIZE = 4
    TRAIN_FILE = "train.csv"
    COARSE_DIR = "coarse"
    REFINE_DIR = "refine"

    REFINE_TRAIN = True
    FINE_TUNE = True

    def train():
        with tf.Graph().as_default():
            global_step = tf.Variable(0, trainable=False)
            dataset = DataSet(BATCH_SIZE)
            images, depths, invalid_depths = dataset.csv_inputs(TRAIN_FILE)
            keep_conv = tf.placeholder(tf.float32)
            keep_hidden = tf.placeholder(tf.float32)
            if REFINE_TRAIN:
                print("refine train.")
                coarse = model.inference(images, keep_conv, trainable=False)
                logits = model.inference_refine(images, coarse, keep_conv, keep_hidden)
            else:
                print("coarse train.")
                logits = model.inference(images, keep_conv, keep_hidden)
            loss = model.loss(logits, depths, invalid_depths)
            train_op = op.train(loss, global_step, BATCH_SIZE)
            init_op = tf.global_variables_initializer()

            # Session
            '''
            BFC Allocator Method
            # Without softplacement creepy errors
            config = tf.ConfigProto(allow_soft_placement=True)
            config.gpu_options.allocator_type = 'BFC'
            config.gpu_options.per_process_gpu_memory_fraction = 0.90
            config.gpu_options.allow_growth = True
            sess = tf.Session(config=config)
            '''

            '''
            # Lengthy log
            gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.8)
            sess = tf.Session(config=tf.ConfigProto(log_device_placement=True, gpu_options=gpu_options))
            '''
            sess = tf.Session(config=tf.ConfigProto(log_device_placement=LOG_DEVICE_PLACEMENT))
            sess.run(init_op)

            # parameters
            coarse_params = {}
            refine_params = {}
            if REFINE_TRAIN:
                for variable in tf.all_variables():
                    variable_name = variable.name.replace(':','__')
                    print("parameter: %s" % (variable_name))
                    if variable_name.find("/") < 0 or variable_name.count("/") != 1:
                        continue
                    if variable_name.find('coarse') >= 0:
                        coarse_params[variable_name] = variable
                    print("parameter: %s" %(variable_name))
                    if variable_name.find('fine') >= 0:
                        refine_params[variable_name] = variable
            else:
                for variable in tf.trainable_variables():
                    variable_name = variable.name.replace(':','__')
                    print("parameter: %s" %(variable_name))
                    if variable_name.find("/") < 0 or variable_name.count("/") != 1:
                        continue
                    if variable_name.find('coarse') >= 0:
                        coarse_params[variable_name] = variable
                    if variable_name.find('fine') >= 0:
                        refine_params[variable_name] = variable
            # define saver
            print (coarse_params)
            saver_coarse = tf.train.Saver(coarse_params)
            if REFINE_TRAIN:
                saver_refine = tf.train.Saver(refine_params)
            # fine tune
            if FINE_TUNE:
                coarse_ckpt = tf.train.get_checkpoint_state(COARSE_DIR)
                if coarse_ckpt and coarse_ckpt.model_checkpoint_path:
                    print("Pretrained coarse Model Loading.")
                    saver_coarse.restore(sess, coarse_ckpt.model_checkpoint_path)
                    print("Pretrained coarse Model Restored.")
                else:
                    print("No Pretrained coarse Model.")
                if REFINE_TRAIN:
                    refine_ckpt = tf.train.get_checkpoint_state(REFINE_DIR)
                    if refine_ckpt and refine_ckpt.model_checkpoint_path:
                        print("Pretrained refine Model Loading.")
                        saver_refine.restore(sess, refine_ckpt.model_checkpoint_path)
                        print("Pretrained refine Model Restored.")
                    else:
                        print("No Pretrained refine Model.")

            # train
            coord = tf.train.Coordinator()
            threads = tf.train.start_queue_runners(sess=sess, coord=coord)
            for step in range(MAX_STEPS):
                index = 0
                for i in range(1000):
                    _, loss_value, logits_val, images_val = sess.run([train_op, loss, logits, images], feed_dict={keep_conv: 0.8, keep_hidden: 0.5})
                    if index % 10 == 0:
                        print("%s: %d[epoch]: %d[iteration]: train loss %f" % (datetime.now(), step, index, loss_value))
                        assert not np.isnan(loss_value), 'Model diverged with loss = NaN'
                    if index % 500 == 0:
                        if REFINE_TRAIN:
                            output_predict(logits_val, images_val, "data/predict_refine_%05d_%05d" % (step, i))
                        else:
                            output_predict(logits_val, images_val, "data/predict_%05d_%05d" % (step, i))
                    index += 1

                if step % 5 == 0 or (step * 1) == MAX_STEPS:
                    if REFINE_TRAIN:
                        refine_checkpoint_path = REFINE_DIR + '/model.ckpt'
                        saver_refine.save(sess, refine_checkpoint_path, global_step=step)
                    else:
                        coarse_checkpoint_path = COARSE_DIR + '/model.ckpt'
                        saver_coarse.save(sess, coarse_checkpoint_path, global_step=step)
            coord.request_stop()
            coord.join(threads)
            sess.close()
4

0 回答 0