python - 通过aitextgen获取带有小数据集（3MB）的MemoryError微调GPT2（355M）模型

Question

我正在使用 aitextgen 使用 train 函数微调 355M GPT-2 模型。数据集是由以下行组成的小型 txt 文件（这些是基于关键字的文本生成的编码文本，因此是“~^keywords~@”）：

<|startoftext|>~^~@"Yes, but one forgets that she is there--or anywhere. She seems as if she were an accident."<|endoftext|>
<|startoftext|>~^man~@"Then jump out and unharness this horse. A man will come for it to- morrow."<|endoftext|>
<|startoftext|>~^mind 's~@"It would upset the house terribly," said Nan; "but I don't mind that. I'm with you, Patty. Let's do it."<|endoftext|>
<|startoftext|>~^Booth sure say wish~@"I wish I were sure that I had," said Booth.<|endoftext|>

我像这样使用aitextgen的训练功能：

    gpt2 = aitextgen(tf_gpt2 = "355M", to_gpu= True)

    gpt2.train(dataset,
               line_by_line = True,
               batch_size= 1,
               num_steps = 50,
               save_every = 10,
               generate_every = 10,
               learning_rate = 1e-3,
               fp16 = False)

当我运行这个函数时，我得到这个输出：

0%|          | 0/10000 [00:00<?, ?it/s]
Windows does not support multi-GPU training. Setting to 1 GPU.
C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\pytorch_lightning\trainer\connectors\callback_connector.py:147: LightningDeprecationWarning: Setting `Trainer(checkpoint_callback=False)` is deprecated in v1.5 and will be removed in v1.7. Please consider using `Trainer(enable_checkpointing=False)`.
  rank_zero_deprecation(
C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\pytorch_lightning\trainer\connectors\callback_connector.py:90: LightningDeprecationWarning: Setting `Trainer(progress_bar_refresh_rate=20)` is deprecated in v1.5 and will be removed in v1.7. Please pass `pytorch_lightning.callbacks.progress.TQDMProgressBar` with `refresh_rate` directly to the Trainer's `callbacks` argument instead. Or, to disable the progress bar pass `enable_progress_bar = False` to the Trainer.
  rank_zero_deprecation(
C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\pytorch_lightning\trainer\connectors\callback_connector.py:167: LightningDeprecationWarning: Setting `Trainer(weights_summary=None)` is deprecated in v1.5 and will be removed in v1.7. Please set `Trainer(enable_model_summary=False)` instead.
  rank_zero_deprecation(
GPU available: True, used: True
TPU available: False, using: 0 TPU cores
IPU available: False, using: 0 IPUs
LOCAL_RANK: 0 - CUDA_VISIBLE_DEVICES: [0]
  0%|          | 0/50 [00:00<?, ?it/s]

Traceback (most recent call last):
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\transformers\modeling_utils.py", line 1364, in from_pretrained
    state_dict = torch.load(resolved_archive_file, map_location="cpu")
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\torch\serialization.py", line 607, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\torch\serialization.py", line 882, in _load
    result = unpickler.load()
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\torch\serialization.py", line 857, in persistent_load
    load_tensor(data_type, size, key, _maybe_decode_ascii(location))
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\torch\serialization.py", line 845, in load_tensor
    storage = zip_file.get_storage_from_record(name, size, dtype).storage()
RuntimeError: [enforce fail at ..\c10\core\CPUAllocator.cpp:76] data. DefaultCPUAllocator: not enough memory: you tried to allocate 205852672 bytes.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\multiprocessing\spawn.py", line 116, in spawn_main
    exitcode = _main(fd, parent_sentinel)
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\multiprocessing\spawn.py", line 125, in _main
    prepare(preparation_data)
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\multiprocessing\spawn.py", line 236, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\multiprocessing\spawn.py", line 287, in _fixup_main_from_path
    main_content = runpy.run_path(main_path,
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\runpy.py", line 265, in run_path
    return _run_module_code(code, init_globals, run_name,
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\runpy.py", line 97, in _run_module_code
    _run_code(code, mod_globals, init_globals,
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\runpy.py", line 87, in _run_code
    exec(code, run_globals)
  File "C:\Users\Josh\Python Projects\FYP\src\[py file name].py", line 34, in <module>
    gpt2 = aitextgen(tf_gpt2 = "355M", to_gpu= True)
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\aitextgen\aitextgen.py", line 166, in __init__
    self.model = GPT2LMHeadModel.from_pretrained(model, config=config)
  File "C:\Users\Josh\anaconda3\envs\gpt2_env\lib\site-packages\transformers\modeling_utils.py", line 1368, in from_pretrained
    if f.read().startswith("version"):
MemoryError

我尝试了很多方法，包括使用清除 CUDA 缓存torch.cuda.empty_cache()，将文件拆分为更小的文件。他们都没有工作。

我在我的本地机器（RTX3070，32GB RAM）上运行它，我检查了任务管理器，RAM 使用率几乎没有达到 50%。我的代码有什么问题导致内存错误吗？

python - 通过aitextgen获取带有小数据集（3MB）的MemoryError微调GPT2（355M）模型

0 回答 0

Related

Reference