我正在运行一个微调的 BERT 和 ALBERT 模型来进行问答。而且,我正在评估这些模型在SQuAD v2.0的一部分问题上的表现。我使用SQuAD 的官方评估脚本进行评估。
我使用 Huggingface transformers
,在下面您可以找到我正在运行的实际代码和示例(可能对一些尝试在 SQuAD v2.0 上运行 ALBERT 微调模型的人也有帮助):
tokenizer = AutoTokenizer.from_pretrained("ktrapeznikov/albert-xlarge-v2-squad-v2")
model = AutoModelForQuestionAnswering.from_pretrained("ktrapeznikov/albert-xlarge-v2-squad-v2")
question = "Why aren't the examples of bouregois architecture visible today?"
text = """Exceptional examples of the bourgeois architecture of the later periods were not restored by the communist authorities after the war (like mentioned Kronenberg Palace and Insurance Company Rosja building) or they were rebuilt in socialist realism style (like Warsaw Philharmony edifice originally inspired by Palais Garnier in Paris). Despite that the Warsaw University of Technology building (1899\u20131902) is the most interesting of the late 19th-century architecture. Some 19th-century buildings in the Praga district (the Vistula\u2019s right bank) have been restored although many have been poorly maintained. Warsaw\u2019s municipal government authorities have decided to rebuild the Saxon Palace and the Br\u00fchl Palace, the most distinctive buildings in prewar Warsaw."""
input_dict = tokenizer.encode_plus(question, text, return_tensors="pt")
input_ids = input_dict["input_ids"].tolist()
start_scores, end_scores = model(**input_dict)
all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1]).replace('▁', '')
print(answer)
输出如下:
[CLS] why aren ' t the examples of bour ego is architecture visible today ? [SEP] exceptional examples of the bourgeois architecture of the later periods were not restored by the communist authorities after the war
如您所见,答案中有 BERT 的特殊标记,包括[CLS]
和[SEP]
。
我知道在答案只是[CLS]
(有两个tensor(0)
forstart_scores
和end_scores
)的情况下,这基本上意味着模型认为在上下文中没有对问题的答案是有意义的。在这些情况下,我只是在运行评估脚本时将该问题的答案设置为空字符串。
但我想知道在上面的例子中,我是否应该再次假设模型找不到答案并将答案设置为空字符串?或者我应该在评估模型性能时留下这样的答案?
我问这个问题是因为据我了解,如果我有这样的案例作为答案,使用评估脚本计算的性能可能会发生变化(如果我错了,请纠正我)并且我可能无法真正了解这些模型。