python - 是否可以从 Tensorflow 中的检查点模型恢复训练？

Question

我正在做自动分割，周末我正在训练一个模型，然后电源就坏了。我已经训练了我的模型 50 多个小时，并使用以下行每 5 个 epoch 保存我的模型：

model_checkpoint = ModelCheckpoint('test_{epoch:04}.h5', monitor=observe_var, mode='auto', save_weights_only=False, save_best_only=False, period = 5)

我正在使用以下行加载保存的模型：

model = load_model('test_{epoch:04}.h5', custom_objects = {'dice_coef_loss': dice_coef_loss, 'dice_coef': dice_coef})

我已经包含了我的所有数据，这些数据将我的训练数据拆分train_x为扫描和train_y标签。当我运行该行时：

loss, dice_coef = model.evaluate(train_x,  train_y, verbose=1)

我得到错误：

ResourceExhaustedError:  OOM when allocating tensor with shape[32,8,128,128,128] and type float on /job:localhost/replica:0/task:0/device:GPU:0 by allocator GPU_0_bfc
 [[node model/conv3d_1/Conv3D (defined at <ipython-input-1-4a66b6c9f26b>:275) ]]
Hint: If you want to see a list of allocated tensors when OOM happens, add report_tensor_allocations_upon_oom to RunOptions for current allocation info.
 [Op:__inference_distributed_function_3673]

Function call stack:
distributed_function

score 1 · Accepted Answer

This is basically you are running out of memory.So you need to do evaluate in small batch wise.Default batch size is 32 and try allocating small batch size.

evaluate(train_x,  train_y, batch_size=<batch size>)

from keras documentation

batch_size: Integer or None. Number of samples per gradient update. If unspecified, batch_size will default to 32.

python - 是否可以从 Tensorflow 中的检查点模型恢复训练？

1 回答 1

Related

Reference