我目前正在尝试学习 python——同时学习使用 GPT-2 语言建模的机器学习——我遇到了一些问题,我克服了大部分问题,最后得到了一些不错的运行。
但是......正如你们大多数人可能知道的那样,训练你的模型需要大量的 CPU/GPU 能力和时间——我可以节省时间,但问题是我不能让它在我的家用电脑上不间断地运行(是的,我知道我可以租用 GPU @ google)——因为我希望在训练我的模型时能够做任何其他事情。
所以我有以下问题:
- 我可以以某种方式停止并重新开始我的模型训练吗?我读了一些关于检查点的东西,但是关于这个主题的信息已经过时了——所以我无法弄清楚。
- 我可以逐步喂养我的模型 fx。我的数据集的 10%,让它完成 - 然后下周再喂它 10% 等等?如果是这样怎么办?
- 额外的问题......以较低数据集的许多时期为目标会更好吗?还是更大的数据集和更多的时代?什么是好的时期?
套餐:
- 蟒蛇,3.7.9
- TensorFlow-GPU 2.3.0
- 张量流估计器 2.3.0
- 变形金刚 4.2.2
- 标记器 0.9.4
- 库达工具包 10.1
代码 - 分词器
from tokenizers.models import BPE
from tokenizers import Tokenizer
from tokenizers.decoders import ByteLevel as ByteLevelDecoder
from tokenizers.normalizers import NFKC, Sequence
from tokenizers.pre_tokenizers import ByteLevel
from tokenizers.trainers import BpeTrainer
class BPE_token(object):
def __init__(self):
self.tokenizer = Tokenizer(BPE())
self.tokenizer.normalizer = Sequence([
NFKC()
])
self.tokenizer.pre_tokenizer = ByteLevel()
self.tokenizer.decoder = ByteLevelDecoder()
def bpe_train(self, paths):
trainer = BpeTrainer(vocab_size=50000, show_progress=True, inital_alphabet=ByteLevel.alphabet(), special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>"
])
self.tokenizer.train(trainer, paths)
def save_tokenizer(self, location, prefix=None):
if not os.path.exists(location):
os.makedirs(location)
self.tokenizer.model.save(location, prefix)
# ////////// TOKENIZE DATA ////////////
from pathlib import Pa th
import os# the folder 'text' contains all the files
paths = [str(x) for x in Path("./da_corpus/").glob("**/*.txt")]
tokenizer = BPE_token()# train the tokenizer model
tokenizer.bpe_train(paths)# saving the tokenized data in our specified folder
save_path = 'tokenized_data'
tokenizer.save_tokenizer(save_path)
代码——模型培训师
save_path = 'tokenized_data'
tokenizer = GPT2Tokenizer.from_pretrained(save_path)
paths = [str(x) for x in Path("./da_corpus/").glob("**/*.txt")]
# tokenizer = Tokenizer.from_file("./tokenized_data/tokenizer-wiki.json")
tokenizer.add_special_tokens({
"eos_token": "</s>",
"bos_token": "<s>",
"unk_token": "<unk>",
"pad_token": "<pad>",
"mask_token": "<mask>"
})# creating the configurations from which the model can be made
config = GPT2Config(
vocab_size=tokenizer.vocab_size,
bos_token_id=tokenizer.bos_token_id,
eos_token_id=tokenizer.eos_token_id
)# creating the model
model = TFGPT2LMHeadModel(config)
single_string = ''
for filename in paths:
with open(filename, "r", encoding='utf-8') as f:
x = f.read()
single_string += x + tokenizer.eos_token
string_tokenized = tokenizer.encode(single_string)
# print(string_tokenized)
examples = []
block_size = 100
BATCH_SIZE = 12
BUFFER_SIZE = 2000
for i in range(0, len(string_tokenized) - block_size + 1, block_size):
examples.append(string_tokenized[i:i + block_size])
inputs, labels = [], []
for ex in examples:
inputs.append(ex[:-1])
labels.append(ex[1:])
dataset = tf.data.Dataset.from_tensor_slices((inputs, labels))
dataset = dataset.shuffle(BUFFER_SIZE).batch(BATCH_SIZE, drop_remainder=True)
# defining our optimizer
optimizer = tf.keras.optimizers.Adam(learning_rate=3e-5, epsilon=1e-08, clipnorm=1.0)
loss = tf.keras.losses.SparseCategoricalCrossentropy(from_logits=True)
metric = tf.keras.metrics.SparseCategoricalAccuracy('accuracy')# compiling the model
model.compile(optimizer=optimizer, loss=[loss, *[None] * model.config.n_layer], metrics=[metric])
num_epoch = 20
history = model.fit(dataset, epochs=num_epoch)
output_dir = './model_bn_custom/'
if not os.path.exists(output_dir):
os.mkdir(output_dir)
model_to_save = model.module if hasattr(model, 'module') else model
output_model_file = os.path.join(output_dir, WEIGHTS_NAME)
output_config_file = os.path.join(output_dir, CONFIG_NAME)
# save model and model configs
model.save_pretrained(output_dir)
model_to_save.config.to_json_file(output_config_file)
# save tokenizer
tokenizer.save_pretrained(output_dir)