tensorflow - 使用 BERT 开发问答系统

Question

我目前正在为我的论文使用 BERT开发一个问答系统（印度尼西亚语）。数据集和给出的问题是印度尼西亚语。

问题是，我仍然不清楚如何逐步开发 BERT 中的问答系统。

根据我在阅读了一些研究期刊和论文后得出的结论，这个过程可能是这样的：

准备主数据集
加载训练前数据
使用预训练数据训练主数据集（以便生成“微调”模型）
对微调后的模型进行聚类
测试（向系统提出问题）
评估

我想问的是：

这些步骤正确吗？或者可能有任何遗漏的步骤？
另外，如果 BERT 提供的默认预训练数据是英文，而我的主要数据集是印度尼西亚语，我如何创建自己的印度尼西亚语预训练数据？
它真的需要在 BERT 中执行数据/模型聚类吗？

我感谢任何有用的答案。非常感谢您提前。

score 0 · Accepted Answer

我会看一下 Huggingface 的问答示例。这至少是一个很好的起点。

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in Transformers?",
    "What does Transformers provide?",
    "Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer_start_scores, answer_end_scores = model(**inputs)

    answer_start = torch.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}\n")

tensorflow - 使用 BERT 开发问答系统

1 回答 1

Related

Reference