tensorflow - 在没有任何输出或错误的情况下，Keras 中的 SSD 实施训练在几次迭代后停止

Question

在第一个 epoch 的几次迭代之后，训练过程停止，没有任何输出或错误消息。Keras 中的 SSD 实现来自https://github.com/rykov8/ssd_keras

    base_lr = 3e-4
#optim = keras.optimizers.Adam(lr=base_lr)
optim = keras.optimizers.RMSprop(lr=base_lr)
#optim = keras.optimizers.SGD(lr=base_lr, momentum=0.9, decay=decay, nesterov=True)
model.compile(optimizer=optim,
              loss=MultiboxLoss(NUM_CLASSES+1, neg_pos_ratio=2.0).compute_loss)



nb_epoch = 10
history = model.fit_generator(gen.generate(True), gen.train_batches,
                              nb_epoch, verbose=1,
                              callbacks=None,
                              validation_data=gen.generate(False),
                              nb_val_samples=gen.val_batches,
                              nb_worker=1
                               )

程序的输出如下：

    Epoch 1/10
/home/deepesh/Documents/ssd_traffic/ssd_utils.py:119: RuntimeWarning: divide by zero encountered in log
  assigned_priors_wh)
2017-10-15 18:00:53.763886: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.54GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:02.602807: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.14GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:03.831092: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.17GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:03.831138: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.10GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:04.774444: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.26GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:05.897872: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.46GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:05.897923: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 3.94GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:09.133494: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.27GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:09.133541: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 1.15GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
2017-10-15 18:01:11.266114: W tensorflow/core/common_runtime/bfc_allocator.cc:217] Allocator (GPU_0_bfc) ran out of memory trying to allocate 2.13GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
13/14 [==========================>...] - ETA: 9s - loss: 2.9617

之后没有输出或错误消息。

score 1 · Accepted Answer

你没有足够的内存，你可以做的事情来解决这个问题：

减少批量大小
减少训练数据的大小
在云中训练你的模型（AMS、谷歌云等）
使用另一个具有更多内存的 GPU 卡
或尝试 CPU

tensorflow - 在没有任何输出或错误的情况下，Keras 中的 SSD 实施训练在几次迭代后停止

1 回答 1

Related

Reference