0

我正在尝试微调 GPT-2 的任务,如果我给出五个连续的数字,下一个连续的数字是什么。例如,如果input_text = "one | two | three | four | five"output_text = "six | seven... | ten"

我通过 huggingface API 使用的模型的重要部分如下:

class Model(pl.LightningModule):
    def __init__(self, 
                 tokenizer, 
                 lr: float) -> None:
        super().__init__()
        self.lr = lr
        self.tokenizer = Tokenizer(tokenizer)
        self.model = GPT2LMHeadModel.from_pretrained('gpt2')
        
    def common_step(self, batch: Tuple[List[str], List[str]]) -> torch.FloatTensor:
        questions, answers = batch
        combined = [input + " <EOS> " + output for input, output in zip(questions, answers)]
        tokens = {k: v.to(self.device) for k, v in self.tokenizer(combined).items()}
        
        labels = tokens["input_ids"].clone()
        labels[tokens["attention_mask"]==0] = -100

        outputs = self.model(
            input_ids=tokens["input_ids"], 
            attention_mask=tokens["attention_mask"],
            labels=labels, 
            return_dict=True
        )
        
        return outputs["loss"]
    
    def training_step(self, batch: Tuple[List[str], List[str]], *args) -> torch.FloatTensor:
        loss = self.common_step(batch)
        return loss
        
    def generate_examples(self, batch):
        questions, answers = batch
        combined = [question + " <EOS> " for question in questions]
        tokens = {k: v.to(self.device) for k, v in self.tokenizer(combined).items()}

        generated = self.model.generate(
            input_ids=tokens["input_ids"], 
            attention_mask=tokens["attention_mask"], 
        )

        print(questions[0])
        print("="*30)
        print(self.tokenizer.decode(generated[0]))

我可以的输出是试图吐出数字,但不幸的是看起来像这样。输出从显示标签的位置开始,否则它只是复制内容。请注意,GPT-2 标记器没有:

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
<|endoftext|>five thousand, five hundred and ninety-one| five thousand, five hundred and ninety-two| five thousand, five hundred and ninety-three| five thousand, five hundred and ninety-four| five thousand, five hundred and ninety-five <EOS> <|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|> fifteen thousand, four hundred and thirty-six| ten thousand, six hundred and sixty-seven| fifteen thousand and sixty‑eight| 15 thousand and eighty-nine| fifteen hundred and seventy<|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|><|endoftext|>

所以问题是为什么它会在一堆 <|endoftext|> 标记之后生成一个可能的候选者。在训练集中,通过“”字(它不是实际的标记)组合输入和输出,并且输出立即出现,没有任何填充。

这与我在下面定义的我使用的标记器有关吗?

# make sure GPT2 appends EOS in begin and end
def build_inputs_with_special_tokens(self, token_ids_0, token_ids_1=None):
    outputs = [self.bos_token_id] + token_ids_0 + [self.eos_token_id]
    return outputs
    
GPT2Tokenizer.build_inputs_with_special_tokens = build_inputs_with_special_tokens
gpt2_tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
# set pad_token_id to unk_token_id -> be careful here as unk_token_id == eos_token_id == bos_token_id
gpt2_tokenizer.pad_token = gpt2_tokenizer.unk_token

可以在此处找到有关 colab 的工作示例。

4

0 回答 0