0

我使用 google colab 进行 gpt 训练。我告诉它将检查点保存到我的谷歌驱动器文件夹中:

gpt2.finetune(sess,
              dataset=file_name,
              model_name=model_name,
              steps=80000,
              #restore_from='run3',
              run_name='bce_1',
              checkpoint_dir='/content/drive/MyDrive/checkpoints/',
              print_every=100,
              sample_every=200,
              save_every=1000,
              )

所以我的假设是,当 colab 中断会话时,我将能够加载检查点并从中生成一些文本:

sess2 = gpt2.start_tf_sess()
gpt2.load_gpt2(sess=sess2, run_name='bce_1', checkpoint_dir='/content/drive/MyDrive/checkpoints/')

但我得到一个例外,即实际模型不存在:

Loading checkpoint /content/drive/MyDrive/checkpoints/bce_1/model-66000
INFO:tensorflow:Restoring parameters from /content/drive/MyDrive/checkpoints/bce_1/model-66000

---------------------------------------------------------------------------

NotFoundError                             Traceback (most recent call last)

/tensorflow-1.15.2/python3.7/tensorflow_core/python/client/session.py in _do_call(self, fn, *args)
   1364     try:
-> 1365       return fn(*args)
   1366     except errors.OpError as e:

12 frames

NotFoundError: 2 root error(s) found.
  (0) Not found: /content/drive/MyDrive/checkpoints/bce_1/model-66000.data-00000-of-00001; No such file or directory
     [[{{node save/RestoreV2}}]]
     [[save/RestoreV2/_133]]
  (1) Not found: /content/drive/MyDrive/checkpoints/bce_1/model-66000.data-00000-of-00001; No such file or directory
     [[{{node save/RestoreV2}}]]
0 successful operations.
0 derived errors ignored.

它确实不存在,也不在垃圾箱中:

total 5264
-rw------- 1 root root     161 Jul 22 00:23 checkpoint
-rw------- 1 root root       6 Jul 21 06:50 counter
-rw------- 1 root root 1042301 Jul 21 00:43 encoder.json
-rw------- 1 root root      90 Jul 21 00:43 hparams.json
-rw------- 1 root root    5215 Jul 22 00:21 model-66000.index
-rw------- 1 root root 3883813 Jul 22 00:21 model-66000.meta
-rw------- 1 root root  456318 Jul 21 00:43 vocab.bpe

当 colab 中断会话时,我可以做些什么来避免丢失我的训练进度?

4

0 回答 0