0

我目前正在为我的论文使用 BERT开发一个问答系统(印度尼西亚语)。数据集和给出的问题是印度尼西亚语。

问题是,我仍然不清楚如何逐步开发 BERT 中的问答系统。

根据我在阅读了一些研究期刊和论文后得出的结论,这个过程可能是这样的:

  1. 准备主数据集
  2. 加载训练前数据
  3. 使用预训练数据训练主数据集(以便生成“微调”模型)
  4. 对微调后的模型进行聚类
  5. 测试(向系统提出问题)
  6. 评估

我想问的是:

  • 这些步骤正确吗?或者可能有任何遗漏的步骤?
  • 另外,如果 BERT 提供的默认预训练数据是英文,而我的主要数据集是印度尼西亚语,我如何创建自己的印度尼西亚语预训练数据?
  • 它真的需要在 BERT 中执行数据/模型聚类吗?

我感谢任何有用的答案。非常感谢您提前。

4

1 回答 1

0

我会看一下 Huggingface 的问答示例。这至少是一个很好的起点。

from transformers import AutoTokenizer, AutoModelForQuestionAnswering
import torch

tokenizer = AutoTokenizer.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")
model = AutoModelForQuestionAnswering.from_pretrained("bert-large-uncased-whole-word-masking-finetuned-squad")

text = r"""
 Transformers (formerly known as pytorch-transformers and pytorch-pretrained-bert) provides general-purpose
architectures (BERT, GPT-2, RoBERTa, XLM, DistilBert, XLNet…) for Natural Language Understanding (NLU) and Natural
Language Generation (NLG) with over 32+ pretrained models in 100+ languages and deep interoperability between
TensorFlow 2.0 and PyTorch.
"""

questions = [
    "How many pretrained models are available in Transformers?",
    "What does Transformers provide?",
    "Transformers provides interoperability between which frameworks?",
]

for question in questions:
    inputs = tokenizer.encode_plus(question, text, add_special_tokens=True, return_tensors="pt")
    input_ids = inputs["input_ids"].tolist()[0]

    text_tokens = tokenizer.convert_ids_to_tokens(input_ids)
    answer_start_scores, answer_end_scores = model(**inputs)

    answer_start = torch.argmax(
        answer_start_scores
    )  # Get the most likely beginning of answer with the argmax of the score
    answer_end = torch.argmax(answer_end_scores) + 1  # Get the most likely end of answer with the argmax of the score

    answer = tokenizer.convert_tokens_to_string(tokenizer.convert_ids_to_tokens(input_ids[answer_start:answer_end]))

    print(f"Question: {question}")
    print(f"Answer: {answer}\n")
于 2021-07-24T16:43:27.467 回答