1

我已经使用标记器为 RoBERTa 训练了一个自定义 BPE 标记器。

我使用run_language_modeling.py提供的骨架在蒙面 LM 任务上训练了自定义模型

模型在一个支持的评估集上达到困惑3.2832

这是解码模型预测时让我感到困惑的地方:

使用预训练模型时,以下工作正常

>> from transformers import pipeline
>> nlp = pipeline('fill-mask', model='roberta-base')
>> nlp("has the <mask> ever had checkup")
[{'sequence': '<s> has the cat ever had checkup</s>',
  'score': 0.11192905157804489,
  'token': 4758},
 {'sequence': '<s> has the baby ever had checkup</s>',
  'score': 0.08717110008001328,
  'token': 1928},
 {'sequence': '<s> has the dog ever had checkup</s>',
  'score': 0.07775705307722092,
  'token': 2335},
 {'sequence': '<s> has the man ever had checkup</s>',
  'score': 0.04057956114411354,
  'token': 313},
 {'sequence': '<s> has the woman ever had checkup</s>',
  'score': 0.031859494745731354,
  'token': 693}]

但是,当将自定义 roberta 训练模型与客户 BPE 一起使用时

>> from transformers import RobertaForMaskedLM, RobertaTokenizer
>> model = RobertaForMaskedLM.from_pretrained(path_to_model)
>> tokenizer = RobertaTokenizer.from_pretrained(path_to_tokenizer)
>> nlp = pipeline('fill-mask', model=model, tokenizer=tokenizer)
>> nlp("has the <mask> ever had checkup")
[{'sequence': '<s> the  had never had checkup</s>',
  'score': 0.08322840183973312,
  'token': 225},
 {'sequence': '<s> the - had never had checkup</s>',
  'score': 0.07046554237604141,
  'token': 311},
 {'sequence': '<s> the o had never had checkup</s>',
  'score': 0.020223652943968773,
  'token': 293},
 {'sequence': '<s> the _ had never had checkup</s>',
  'score': 0.013033385388553143,
  'token': 1246},
 {'sequence': '<s> the r had never had checkup</s>',
  'score': 0.011952929198741913,
  'token': 346}]

我的问题是:即使我的自定义模型预测了<mask>令牌的正确替换,它也只是子词令牌。那么如何像预训练模型一样获得完整的单词呢?eg cat被拆分为ca&t所以如果你屏蔽cat,那么我的自定义模型只预测ca剩余的t部分。

以下 config.json 用于训练:

{
  "architectures": [
    "RobertaForMaskedLM"
  ],
  "attention_probs_dropout_prob": 0.1,
  "bos_token_id": null,
  "do_sample": false,
  "eos_token_ids": null,
  "finetuning_task": "mlm",
  "hidden_act": "gelu",
  "hidden_dropout_prob": 0.1,
  "hidden_size": 768,
  "id2label": {
    "0": "LABEL_0",
    "1": "LABEL_1"
  },
  "initializer_range": 0.02,
  "intermediate_size": 3072,
  "is_decoder": false,
  "label2id": {
    "LABEL_0": 0,
    "LABEL_1": 1
  },
  "layer_norm_eps": 1e-12,
  "length_penalty": 1.0,
  "max_length": 100,
  "max_position_embeddings": 514,
  "model_type": "roberta",
  "num_attention_heads": 12,
  "num_beams": 1,
  "num_hidden_layers": 6,
  "num_labels": 2,
  "num_return_sequences": 1,
  "output_attentions": false,
  "output_hidden_states": false,
  "output_past": true,
  "pad_token_id": null,
  "pruned_heads": {},
  "repetition_penalty": 1.0,
  "temperature": 1.0,
  "top_k": 50,
  "top_p": 1.0,
  "torchscript": false,
  "type_vocab_size": 2,
  "use_bfloat16": false,
  "vocab_size": 30522
}

并用于训练自定义标记器:

有些人提到要保持特殊令牌的顺序相同。

from tokenizers import ByteLevelBPETokenizer

# Initialize a tokenizer
tokenizer = ByteLevelBPETokenizer()

# Customize training
tokenizer.train(files=["/home/data/BPE/txt_data/72148_tokens.txt",
                       "/home/data/BPE/txt_data/70551_tokens.txt",
                       "/home/data/BPE/txt_data/70553_tokens.txt",
                       "/home/data/BPE/txt_data/78452_tokens.txt",
                       "/home/data/BPE/txt_data/74177_tokens.txt",
                       "/home/data/BPE/txt_data/71260_tokens.txt",
                       "/home/data/BPE/txt_data/71250_tokens.txt",], vocab_size=30522, min_frequency=10, special_tokens=[
    "<s>",
    "<pad>",
    "</s>",
    "<unk>",
    "<mask>",
])

# Save files to disk
tokenizer.save("/home/data/BPE/", "30k_v3_roberta")
4

0 回答 0