2

我目前正在尝试学习 python——同时学习使用 GPT-2 语言建模的机器学习——我遇到了一些问题,我克服了大部分问题,最后得到了一些不错的运行。

但是......正如你们大多数人可能知道的那样,训练你的模型需要大量的 CPU/GPU 能力和时间——我可以节省时间,但问题是我不能让它在我的家用电脑上不间断地运行(是的,我知道我可以租用 GPU @ google)——因为我希望在训练我的模型时能够做任何其他事情。

所以我有以下问题:

  • 我可以以某种方式停止并重新开始我的模型训练吗?我读了一些关于检查点的东西,但是关于这个主题的信息已经过时了——所以我无法弄清楚。
  • 我可以逐步喂养我的模型 fx。我的数据集的 10%,让它完成 - 然后下周再喂它 10% 等等?如果是这样怎么办?
  • 额外的问题......以较低数据集的许多时期为目标会更好吗?还是更大的数据集和更多的时代?什么是好的时期?

套餐:

  • 蟒蛇,3.7.9
  • TensorFlow-GPU 2.3.0
  • 张量流估计器 2.3.0
  • 变形金刚 4.2.2
  • 标记器 0.9.4
  • 库达工具包 10.1

代码 - 分词器

from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.normalizers import NFKC, Sequence
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.trainers import BpeTrainer

class BPE_token(object):
def __init__(self):
    self.tokenizer = Tokenizer(BPE())
    self.tokenizer.normalizer = Sequence([
        NFKC()
    ])
    self.tokenizer.pre_tokenizer = ByteLevel()
    self.tokenizer.decoder = ByteLevelDecoder()

def bpe_train(self, paths):
    trainer = BpeTrainer(vocab_size=50000, show_progress=True, inital_alphabet=ByteLevel.alphabet(),         special_tokens=[
        "<s>",
        "<pad>",
        "</s>",
        "<unk>",
        "<mask>"
    ])
    self.tokenizer.train(trainer, paths)

def save_tokenizer(self, location, prefix=None):
    if not os.path.exists(location):
        os.makedirs(location)
    self.tokenizer.model.save(location, prefix)

# ////////// TOKENIZE DATA ////////////
from pathlib import Pa th
import os# the folder 'text' contains all the files
paths = [str(x) for x in Path("./da_corpus/").glob("**/*.txt")]
tokenizer = BPE_token()# train the tokenizer model
tokenizer.bpe_train(paths)# saving the tokenized data in our specified folder
save_path = 'tokenized_data'
tokenizer.save_tokenizer(save_path)

代码——模型培训师

save_path = 'tokenized_data'
tokenizer = GPT2Tokenizer.from_pretrained(save_path)
paths = [str(x) for x in Path("./da_corpus/").glob("**/*.txt")]
# tokenizer = Tokenizer.from_file("./tokenized_data/tokenizer-wiki.json")
tokenizer.add_special_tokens({
  "eos_token": "</s>",
  "bos_token": "<s>",
  "unk_token": "<unk>",
  "pad_token": "<pad>",
  "mask_token": "<mask>"
})# creating the configurations from which the model can be made
config = GPT2Config(
  vocab_size=tokenizer.vocab_size,
  bos_token_id=tokenizer.bos_token_id,
  eos_token_id=tokenizer.eos_token_id
)# creating the model
model = TFGPT2LMHeadModel(config)

single_string = ''
for filename in paths:
    with open(filename, "r", encoding='utf-8') as f:
        x = f.read()
    single_string += x + tokenizer.eos_token
string_tokenized = tokenizer.encode(single_string)
# print(string_tokenized)



examples = []
block_size = 100
BATCH_SIZE = 12
BUFFER_SIZE = 2000
for i in range(0, len(string_tokenized) - block_size + 1, block_size):
    examples.append(string_tokenized[i:i + block_size])
    inputs, labels = [], []


for ex in examples:
    inputs.append(ex[:-1])
    labels.append(ex[1:])

dataset = tf.data.Dataset.from_tensor_slices((inputs, labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)

# defining our optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')# compiling the model
model.compile(optimizer=optimizer, loss=[loss, *[None] * model.config.n_layer], metrics=[metric])
num_epoch = 20
history = model.fit(dataset, epochs=num_epoch)


output_dir = './model_bn_custom/'

if not os.path.exists(output_dir):
    os.mkdir(output_dir)


model_to_save = model.module if hasattr(model, 'module') else model
output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(output_dir, CONFIG_NAME)

# save model and model configs
model.save_pretrained(output_dir)
model_to_save.config.to_json_file(output_config_file)

# save tokenizer
tokenizer.save_pretrained(output_dir)
4

0 回答 0