1

I'm new to BERT and I'm trying to edit the output of run_squad.py for build up a Question Answering system and obtain an output file with the following structure:

{
    "data": [
      {
            "id": "ID1",
            "title": "Alan_Turing",
            "question": "When Alan Turing was born?",
            "context": "Alan Mathison Turing (23 June 1912 – 7 June 1954) was an English mathematician, computer scientist, logician, cryptanalyst, philosopher and theoretical biologist. [...] . However, both Julius and Ethel wanted their children to be brought up in Britain, so they moved to Maida Vale, London, where Alan Turing was born on 23 June 1912, as recorded by a blue plaque on the outside of the house of his birth, later the Colonnade Hotel. Turing had an elder brother, John (the father of Sir John Dermot Turing, 12th Baronet of the Turing baronets).",
            "answers": [
              {"text": "on 23 June 1912",   "probability": 0.891726, "start_logit": 4.075,  "end_logit": 4.15},
              {"text": "on 23 June", "probability": 0.091726, "start_logit": 2.075, "end_logit": 1.15},
              {"text": "June 1912", "probability": 0.051726, "start_logit": 1.075, "end_logit": 0.854}
            ]
        },
        {
            "id": "ID2",
            "title": "Title2",
            "question": "Question2",
            "context": "Context 2 ...",
            "answers": [
              {"text": "text1", "probability": 0.891726, "start_logit": 4.075, "end_logit": 4.15},
              {"text": "text2", "probability": 0.091726, "start_logit": 2.075, "end_logit": 1.15},
              {"text": "text3", "probability": 0.051726, "start_logit": 1.075, "end_logit": 0.854}
            ]
        }
    ]
}

First of all, in the read_squad_example function (line 227 of run_squad.py) BERT read a SQuAD json file (the input file) into a list of SquadExample, this file contains the first four fields that I need: id, title, question and context.

Afterwards the SquadExamples are converted to features and then the write_predictions phase (line 741) can start.

In write_predictions BERT write an output file, called nbest_predictions.json, that contains all possible answers for a specific context with a probability associated.

On lines 891-898 I think the last four fields that I need (text, probability, start_logit, end_logit) are appended:

nbest_json = []
    for (i, entry) in enumerate(nbest):
      output = collections.OrderedDict()
      output["text"] = entry.text
      output["probability"] = probs[i]
      output["start_logit"] = entry.start_logit
      output["end_logit"] = entry.end_logit
nbest_json.append(output)

The output file nbest_predictions.json have the following structure:

{
    "ID-1": [
        {
            "text": "text1", 
            "probability": 0.3617, 
            "start_logit": 4.0757, 
            "end_logit": 4.1554
        }, {
            "text": "text2", 
            "probability": 0.0036, 
            "start_logit": -0.5180, 
            "end_logit": 4.1554
        }
    ], 
    "ID-2": [
        {
            "text": "text1", 
            "probability": 0.2487, 
            "start_logit": -1.6009, 
            "end_logit": -0.2818
        }, {
            "text": "text2", 
            "probability": 0.0070, 
            "start_logit": -0.9566, 
            "end_logit": -1.5770
        }
    ]
}

Now...I don't exactly understand how the nbest_predictions file is generated. How can I edit this function and obtain a json file structured as I indicated at the beginning of my post?

Reflecting upon that I think I have two possibilities:

  1. Create a new data structure and append the fields that I need.
  2. Edit the write_predictions function to get the nbest_predictions.json structured in the way I want to.

What is the best solution?

Currently I wrote a new function that read the input file and append to a data structure my id, title, question and context:

import json
import tensorflow as tf


def read_squad_examples2(input_file, is_training):
  # SQUAD json file to list of SquadExamples #
  with tf.gfile.Open(input_file, "r") as reader:
    input_data = json.load(reader)["data"]

  def is_whitespace(c):
    if c == " " or c == "\t" or c == "\r" or c == "\n" or ord(c) == 0x202F:
      return True
    return False

  data = {}
  sup_data = [] 

  for entry in input_data:
    entry_title = entry["title"]
    data["title"] = entry_title;
    for paragraph in entry["paragraphs"]:
      paragraph_text = paragraph["context"]
      data["context"] = paragraph_text;
      for qa in paragraph["qas"]:
        qas_id = qa["id"]
        data["id"] = qas_id;
        question_text = qa["question"]
        data["question"] = question_text

        sup_data.append(data)

  my_json = json.dumps(sup_data)

  return my_json

What I get is:

[{
    "question": "Question 1?",
    "id": "ID 1 ",
    "context": "The context 1",
    "title": "Title 1"
}, {
    "question": "Question 2?",
    "id": "ID 2 ",
    "context": "The context 2",
    "title": "Title 2"
}]

answers在这一点上,我怎样才能将包含“文本”、“概率”、“start_logit”和“end_logit”的字段附加到这个数据结构中?

提前致谢。

4

0 回答 0