I'm new to BERT and I'm trying to edit the output of run_squad.py for build up a Question Answering system and obtain an output file with the following structure:
{
"data": [
{
"id": "ID1",
"title": "Alan_Turing",
"question": "When Alan Turing was born?",
"context": "Alan Mathison Turing (23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. [...] . However, both Julius and Ethel wanted their children to be brought up in Britain, so they moved to Maida Vale, London, where Alan Turing was born on 23 June 1912, as recorded by a blue plaque on the outside of the house of his birth, later the Colonnade Hotel. Turing had an elder brother, John (the father of Sir John Dermot Turing, 12th Baronet of the Turing baronets).",
"answers": [
{"text": "on 23 June 1912", "probability": 0.891726, "start_logit": 4.075, "end_logit": 4.15},
{"text": "on 23 June", "probability": 0.091726, "start_logit": 2.075, "end_logit": 1.15},
{"text": "June 1912", "probability": 0.051726, "start_logit": 1.075, "end_logit": 0.854}
]
},
{
"id": "ID2",
"title": "Title2",
"question": "Question2",
"context": "Context 2 ...",
"answers": [
{"text": "text1", "probability": 0.891726, "start_logit": 4.075, "end_logit": 4.15},
{"text": "text2", "probability": 0.091726, "start_logit": 2.075, "end_logit": 1.15},
{"text": "text3", "probability": 0.051726, "start_logit": 1.075, "end_logit": 0.854}
]
}
]
}
First of all, in the read_squad_example
function (line 227 of run_squad.py) BERT read a SQuAD json file (the input file) into a list of SquadExample, this file contains the first four fields that I need: id, title, question and context.
Afterwards the SquadExamples are converted to features and then the write_predictions
phase (line 741) can start.
In write_predictions
BERT write an output file, called nbest_predictions.json
, that contains all possible answers for a specific context with a probability associated.
On lines 891-898 I think the last four fields that I need (text, probability, start_logit, end_logit) are appended:
nbest_json = []
for (i, entry) in enumerate(nbest):
output = collections.OrderedDict()
output["text"] = entry.text
output["probability"] = probs[i]
output["start_logit"] = entry.start_logit
output["end_logit"] = entry.end_logit
nbest_json.append(output)
The output file nbest_predictions.json
have the following structure:
{
"ID-1": [
{
"text": "text1",
"probability": 0.3617,
"start_logit": 4.0757,
"end_logit": 4.1554
}, {
"text": "text2",
"probability": 0.0036,
"start_logit": -0.5180,
"end_logit": 4.1554
}
],
"ID-2": [
{
"text": "text1",
"probability": 0.2487,
"start_logit": -1.6009,
"end_logit": -0.2818
}, {
"text": "text2",
"probability": 0.0070,
"start_logit": -0.9566,
"end_logit": -1.5770
}
]
}
Now...I don't exactly understand how the nbest_predictions file is generated. How can I edit this function and obtain a json file structured as I indicated at the beginning of my post?
Reflecting upon that I think I have two possibilities:
- Create a new data structure and append the fields that I need.
- Edit the
write_predictions
function to get thenbest_predictions.json
structured in the way I want to.
What is the best solution?
Currently I wrote a new function that read the input file and append to a data structure my id, title, question and context:
import json
import tensorflow as tf
def read_squad_examples2(input_file, is_training):
# SQUAD json file to list of SquadExamples #
with tf.gfile.Open(input_file, "r") as reader:
input_data = json.load(reader)["data"]
def is_whitespace(c):
if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
return True
return False
data = {}
sup_data = []
for entry in input_data:
entry_title = entry["title"]
data["title"] = entry_title;
for paragraph in entry["paragraphs"]:
paragraph_text = paragraph["context"]
data["context"] = paragraph_text;
for qa in paragraph["qas"]:
qas_id = qa["id"]
data["id"] = qas_id;
question_text = qa["question"]
data["question"] = question_text
sup_data.append(data)
my_json = json.dumps(sup_data)
return my_json
What I get is:
[{
"question": "Question 1?",
"id": "ID 1 ",
"context": "The context 1",
"title": "Title 1"
}, {
"question": "Question 2?",
"id": "ID 2 ",
"context": "The context 2",
"title": "Title 2"
}]
answers
在这一点上,我怎样才能将包含“文本”、“概率”、“start_logit”和“end_logit”的字段附加到这个数据结构中?
提前致谢。