python - python NLP中下一句预测的BERT输入格式

Question

我正在尝试训练一个 BERT 模型来预测正确的下一个话语。我得到了一个解开的对话，我试图从 100 个可能不包含正确的下一个话语的候选池中选择下一个话语。我正在尝试创建一个基于此输入中的数据训练的模型：

  {
"data-split": "train",
    "example-id": 0,
    "messages-so-far": [
        {
            "date": "2007-02-13",
            "speaker": "participant_0",
            "time": "07:31",
            "utterance": "hi guys, i need some urgent help. i \"rm -rf'd\" a direcotry. any way i can recover it?"
        },
        {
            "date": "2007-02-13",
            "speaker": "participant_1",
            "time": "07:31",
            "utterance": "participant_0 : in short, no."
        },
        {
            "date": "2007-02-13",
            "speaker": "participant_0",
            "time": "07:31",
            "utterance": "participant_1 , are you sure?"
        },
        ...
    ],
    "options-for-correct-answers": [
        {
            "candidate-id": "3d06877cb2f0c1861b248860fa60ce07",
            "speaker": "participant_1",
            "utterance": "\"Are you sure?\" is something rm -rf never asks.."
        }
    ],
    "options-for-next": [
        {
            "candidate-id": "ace962b708d559fc462b7fdd9b6fc093",
            "speaker": "participant_1",
            "utterance": "(and if hardware is detected correctly, of course)"
        },
        {
            "candidate-id": "349efca9c3d5986a87d95fb90c1b7c04",
            "speaker": "participant_2",
            "utterance": "how do i do a simulated reboot"
        },
        ...
     ],
  "scenario": 1 
  }

messages-so-far 字段包含对话的上下文，options-for-next 包含要从中选择下一个话语的候选者。正确的下一个话语在字段 options-for-correct-answers 中给出。现场场景是指子任务。

我应该把这些数据做成什么格式？它目前是 JSON 格式。我知道它需要是一个 tsv 文件，但我很难弄清楚列中应该包含什么。

我编写了将其放入这种格式的代码

但我不认为这是我想要的。

作为参考，这是将其处理为该格式的代码。关于如何将其更改为我希望能够将其输入到 TSV 文件以进行 BERT 培训的任何建议都很棒！

    import json

file_path = "/Users/madison/Desktop/Final 1671/NOESIS-II/subtask1/data/task-1.advising.train.json"

with open(file_path) as json_file:
    records = (json.load(json_file))

    example_id = []
last_sentence = []
next_sentence = []

for row in records:

  example_id.append(row['example-id'])
  last_sentence.append(row['messages-so-far'][-1]['utterance'])

  if len(row['options-for-correct-answers']) != 0:
    next_sentence.append(row['options-for-correct-answers'][0]['utterance'])
  else:
    next_sentence.append("None")
   
import pandas as pd

data = {"example_id": example_id, "last_sentence": last_sentence, "next_sentence": next_sentence}
df = pd.DataFrame(data)

print(df.head())

python - python NLP中下一句预测的BERT输入格式

0 回答 0

Related

Reference