我使用 google colab 进行 gpt 训练。我告诉它将检查点保存到我的谷歌驱动器文件夹中:
gpt2.finetune(sess,
dataset=file_name,
model_name=model_name,
steps=80000,
#restore_from='run3',
run_name='bce_1',
checkpoint_dir='/content/drive/MyDrive/checkpoints/',
print_every=100,
sample_every=200,
save_every=1000,
)
所以我的假设是,当 colab 中断会话时,我将能够加载检查点并从中生成一些文本:
sess2 = gpt2.start_tf_sess()
gpt2.load_gpt2(sess=sess2, run_name='bce_1', checkpoint_dir='/content/drive/MyDrive/checkpoints/')
但我得到一个例外,即实际模型不存在:
Loading checkpoint /content/drive/MyDrive/checkpoints/bce_1/model-66000
INFO:tensorflow:Restoring parameters from /content/drive/MyDrive/checkpoints/bce_1/model-66000
---------------------------------------------------------------------------
NotFoundError Traceback (most recent call last)
/tensorflow-1.15.2/python3.7/tensorflow_core/python/client/session.py in _do_call(self, fn, *args)
1364 try:
-> 1365 return fn(*args)
1366 except errors.OpError as e:
12 frames
NotFoundError: 2 root error(s) found.
(0) Not found: /content/drive/MyDrive/checkpoints/bce_1/model-66000.data-00000-of-00001; No such file or directory
[[{{node save/RestoreV2}}]]
[[save/RestoreV2/_133]]
(1) Not found: /content/drive/MyDrive/checkpoints/bce_1/model-66000.data-00000-of-00001; No such file or directory
[[{{node save/RestoreV2}}]]
0 successful operations.
0 derived errors ignored.
它确实不存在,也不在垃圾箱中:
total 5264
-rw------- 1 root root 161 Jul 22 00:23 checkpoint
-rw------- 1 root root 6 Jul 21 06:50 counter
-rw------- 1 root root 1042301 Jul 21 00:43 encoder.json
-rw------- 1 root root 90 Jul 21 00:43 hparams.json
-rw------- 1 root root 5215 Jul 22 00:21 model-66000.index
-rw------- 1 root root 3883813 Jul 22 00:21 model-66000.meta
-rw------- 1 root root 456318 Jul 21 00:43 vocab.bpe
当 colab 中断会话时,我可以做些什么来避免丢失我的训练进度?