我已经使用标记器为 RoBERTa 训练了一个自定义 BPE 标记器。
我使用run_language_modeling.py提供的骨架在蒙面 LM 任务上训练了自定义模型
模型在一个支持的评估集上达到困惑3.2832
。
这是解码模型预测时让我感到困惑的地方:
使用预训练模型时,以下工作正常
>> from transformers import pipeline
>> nlp = pipeline('fill-mask', model='roberta-base')
>> nlp("has the <mask> ever had checkup")
[{'sequence': '<s> has the cat ever had checkup</s>',
'score': 0.11192905157804489,
'token': 4758},
{'sequence': '<s> has the baby ever had checkup</s>',
'score': 0.08717110008001328,
'token': 1928},
{'sequence': '<s> has the dog ever had checkup</s>',
'score': 0.07775705307722092,
'token': 2335},
{'sequence': '<s> has the man ever had checkup</s>',
'score': 0.04057956114411354,
'token': 313},
{'sequence': '<s> has the woman ever had checkup</s>',
'score': 0.031859494745731354,
'token': 693}]
但是,当将自定义 roberta 训练模型与客户 BPE 一起使用时
>> from transformers import RobertaForMaskedLM, RobertaTokenizer
>> model = RobertaForMaskedLM.from_pretrained(path_to_model)
>> tokenizer = RobertaTokenizer.from_pretrained(path_to_tokenizer)
>> nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>> nlp("has the <mask> ever had checkup")
[{'sequence': '<s> the had never had checkup</s>',
'score': 0.08322840183973312,
'token': 225},
{'sequence': '<s> the - had never had checkup</s>',
'score': 0.07046554237604141,
'token': 311},
{'sequence': '<s> the o had never had checkup</s>',
'score': 0.020223652943968773,
'token': 293},
{'sequence': '<s> the _ had never had checkup</s>',
'score': 0.013033385388553143,
'token': 1246},
{'sequence': '<s> the r had never had checkup</s>',
'score': 0.011952929198741913,
'token': 346}]
我的问题是:即使我的自定义模型预测了<mask>
令牌的正确替换,它也只是子词令牌。那么如何像预训练模型一样获得完整的单词呢?eg cat
被拆分为ca
&t
所以如果你屏蔽cat
,那么我的自定义模型只预测ca
剩余的t
部分。
以下 config.json 用于训练:
{
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": null,
"do_sample": false,
"eos_token_ids": null,
"finetuning_task": "mlm",
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 768,
"id2label": {
"0": "LABEL_0",
"1": "LABEL_1"
},
"initializer_range": 0.02,
"intermediate_size": 3072,
"is_decoder": false,
"label2id": {
"LABEL_0": 0,
"LABEL_1": 1
},
"layer_norm_eps": 1e-12,
"length_penalty": 1.0,
"max_length": 100,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 12,
"num_beams": 1,
"num_hidden_layers": 6,
"num_labels": 2,
"num_return_sequences": 1,
"output_attentions": false,
"output_hidden_states": false,
"output_past": true,
"pad_token_id": null,
"pruned_heads": {},
"repetition_penalty": 1.0,
"temperature": 1.0,
"top_k": 50,
"top_p": 1.0,
"torchscript": false,
"type_vocab_size": 2,
"use_bfloat16": false,
"vocab_size": 30522
}
并用于训练自定义标记器:
有些人提到要保持特殊令牌的顺序相同。
from tokenizers import ByteLevelBPETokenizer
# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()
# Customize training
tokenizer.train(files=["/home/data/BPE/txt_data/72148_tokens.txt",
"/home/data/BPE/txt_data/70551_tokens.txt",
"/home/data/BPE/txt_data/70553_tokens.txt",
"/home/data/BPE/txt_data/78452_tokens.txt",
"/home/data/BPE/txt_data/74177_tokens.txt",
"/home/data/BPE/txt_data/71260_tokens.txt",
"/home/data/BPE/txt_data/71250_tokens.txt",], vocab_size=30522, min_frequency=10, special_tokens=[
"<s>",
"<pad>",
"</s>",
"<unk>",
"<mask>",
])
# Save files to disk
tokenizer.save("/home/data/BPE/", "30k_v3_roberta")