2

我正在运行一个微调的 BERT 和 ALBERT 模型来进行问答。而且,我正在评估这些模型在SQuAD v2.0的一部分问题上的表现。我使用SQuAD 的官方评估脚本进行评估。

我使用 Huggingface transformers,在下面您可以找到我正在运行的实际代码和示例(可能对一些尝试在 SQuAD v2.0 上运行 ALBERT 微调模型的人也有帮助):

tokenizer = AutoTokenizer.from_pretrained("ktrapeznikov/albert-xlarge-v2-squad-v2")
model = AutoModelForQuestionAnswering.from_pretrained("ktrapeznikov/albert-xlarge-v2-squad-v2")

question = "Why aren't the examples of bouregois architecture visible today?"
text = """Exceptional examples of the bourgeois architecture of the later periods were not restored by the communist authorities after the war (like mentioned Kronenberg Palace and Insurance Company Rosja building) or they were rebuilt in socialist realism style (like Warsaw Philharmony edifice originally inspired by Palais Garnier in Paris). Despite that the Warsaw University of Technology building (1899\u20131902) is the most interesting of the late 19th-century architecture. Some 19th-century buildings in the Praga district (the Vistula\u2019s right bank) have been restored although many have been poorly maintained. Warsaw\u2019s municipal government authorities have decided to rebuild the Saxon Palace and the Br\u00fchl Palace, the most distinctive buildings in prewar Warsaw."""

input_dict = tokenizer.encode_plus(question, text, return_tensors="pt")
input_ids = input_dict["input_ids"].tolist()
start_scores, end_scores = model(**input_dict)

all_tokens = tokenizer.convert_ids_to_tokens(input_ids[0])
answer = ' '.join(all_tokens[torch.argmax(start_scores) : torch.argmax(end_scores)+1]).replace('▁', '')
print(answer)

输出如下:

[CLS] why aren ' t the examples of bour ego is architecture visible today ? [SEP] exceptional examples of the  bourgeois architecture of the later periods were not restored by the communist authorities after the war

如您所见,答案中有 BERT 的特殊标记,包括[CLS][SEP]

我知道在答案只是[CLS](有两个tensor(0)forstart_scoresend_scores)的情况下,这基本上意味着模型认为在上下文中没有对问题的答案是有意义的。在这些情况下,我只是在运行评估脚本时将该问题的答案设置为空字符串。

我想知道在上面的例子中,我是否应该再次假设模型找不到答案并将答案设置为空字符串?或者我应该在评估模型性能时留下这样的答案?

我问这个问题是因为据我了解,如果我有这样的案例作为答案,使用评估脚本计算的性能可能会发生变化(如果我错了,请纠正我)并且我可能无法真正了解这些模型。

4

1 回答 1

1

您应该简单地将它们视为无效,因为您尝试从变量中预测正确的答案范围text。其他一切都应该是无效的。这也是 huggingface处理这些预测的方式:

我们可以假设性地创建无效的预测,例如,预测跨度的开始在问题中。我们丢弃所有无效的预测。

您还应该注意,他们使用更复杂的方法来获得每个问题的预测(不要问我为什么他们在示例中显示 torch.argmax)。请看下面的例子:

from transformers.data.processors.squad import SquadResult, SquadExample, SquadFeatures,SquadV2Processor, squad_convert_examples_to_features
from transformers.data.metrics.squad_metrics import compute_predictions_logits, squad_evaluate

###
#your example code
###

outputs = model(**input_dict)

def to_list(tensor):
    return tensor.detach().cpu().tolist()

output = [to_list(output[0]) for output in outputs]
start_logits, end_logits = output

all_results = []
all_results.append(SquadResult(1000000000, start_logits, end_logits))

#this is the answers section from the evaluation dataset
answers = [{'text':'not restored by the communist authorities', 'answer_start':77}, {'text':'were not restored', 'answer_start':72}, {'text':'not restored by the communist authorities after the war', 'answer_start':77}]

examples = [SquadExample('0', question, text, 'not restored by the communist authorities', 75, 'Warsaw', answers,False)]

#this does basically the same as tokenizer.encode_plus() but stores them in a SquadFeatures Object and splits if neccessary
features = squad_convert_examples_to_features(examples, tokenizer, 512, 100, 64, True)

predictions = compute_predictions_logits(
            examples,
            features,
            all_results,
            20,
            30,
            True,
            'pred.file',
            'nbest_file',
            'null_log_odds_file',
            False,
            True,
            0.0,
            tokenizer
            )

result = squad_evaluate(examples, predictions)

print(predictions)
for x in result.items():
  print(x)

输出:

OrderedDict([('0', 'communist authorities after the war')])
('exact', 0.0)
('f1', 72.72727272727273)
('total', 1)
('HasAns_exact', 0.0)
('HasAns_f1', 72.72727272727273)
('HasAns_total', 1)
('best_exact', 0.0)
('best_exact_thresh', 0.0)
('best_f1', 72.72727272727273)
('best_f1_thresh', 0.0)
于 2020-02-13T21:41:42.787 回答