我正在尝试训练一个 BERT 模型来预测正确的下一个话语。我得到了一个解开的对话,我试图从 100 个可能不包含正确的下一个话语的候选池中选择下一个话语。我正在尝试创建一个基于此输入中的数据训练的模型:
{
"data-split": "train",
"example-id": 0,
"messages-so-far": [
{
"date": "2007-02-13",
"speaker": "participant_0",
"time": "07:31",
"utterance": "hi guys, i need some urgent help. i \"rm -rf'd\" a direcotry. any way i can recover it?"
},
{
"date": "2007-02-13",
"speaker": "participant_1",
"time": "07:31",
"utterance": "participant_0 : in short, no."
},
{
"date": "2007-02-13",
"speaker": "participant_0",
"time": "07:31",
"utterance": "participant_1 , are you sure?"
},
...
],
"options-for-correct-answers": [
{
"candidate-id": "3d06877cb2f0c1861b248860fa60ce07",
"speaker": "participant_1",
"utterance": "\"Are you sure?\" is something rm -rf never asks.."
}
],
"options-for-next": [
{
"candidate-id": "ace962b708d559fc462b7fdd9b6fc093",
"speaker": "participant_1",
"utterance": "(and if hardware is detected correctly, of course)"
},
{
"candidate-id": "349efca9c3d5986a87d95fb90c1b7c04",
"speaker": "participant_2",
"utterance": "how do i do a simulated reboot"
},
...
],
"scenario": 1
}
messages-so-far 字段包含对话的上下文,options-for-next 包含要从中选择下一个话语的候选者。正确的下一个话语在字段 options-for-correct-answers 中给出。现场场景是指子任务。
我应该把这些数据做成什么格式?它目前是 JSON 格式。我知道它需要是一个 tsv 文件,但我很难弄清楚列中应该包含什么。
我编写了将其放入这种格式的代码
但我不认为这是我想要的。
作为参考,这是将其处理为该格式的代码。关于如何将其更改为我希望能够将其输入到 TSV 文件以进行 BERT 培训的任何建议都很棒!
import json
file_path = "/Users/madison/Desktop/Final 1671/NOESIS-II/subtask1/data/task-1.advising.train.json"
with open(file_path) as json_file:
records = (json.load(json_file))
example_id = []
last_sentence = []
next_sentence = []
for row in records:
example_id.append(row['example-id'])
last_sentence.append(row['messages-so-far'][-1]['utterance'])
if len(row['options-for-correct-answers']) != 0:
next_sentence.append(row['options-for-correct-answers'][0]['utterance'])
else:
next_sentence.append("None")
import pandas as pd
data = {"example_id": example_id, "last_sentence": last_sentence, "next_sentence": next_sentence}
df = pd.DataFrame(data)
print(df.head())