python - 如何将 Conll 2003 格式转换为 json 格式？

Question

我有一个句子列表，句子的每个单词都在嵌套列表中。如：

[['EU', 'rejects', 'German', 'call', 'to', 'boycott', 'British', 'lamb', '.'],
 ['Peter', 'Blackburn'],
 ['BRUSSELS', '1996-08-22']]

还有另一个列表，其中每个单词都对应一个实体标签。如：

[['B-ORG', 'O', 'B-MISC', 'O', 'O', 'O', 'B-MISC', 'O', 'O'],
 ['B-PER', 'I-PER'],
 ['B-LOC', 'O']]

这是基本的 ConLL2003 数据，但我实际上使用的是另一种语言的不同数据。我仅将这个作为示例表示。

我想将此列表列表转换为 JsonL 格式，其中格式为：

{"text": "EU rejects German call to boycott British lamb.", "labels": [ [0, 2, "ORG"], [11, 17, "MISC"], ... ]}
{"text": "Peter Blackburn", "labels": [ [0, 15, "PERSON"] ]}
{"text": "President Obama", "labels": [ [10, 15, "PERSON"] ]}

到目前为止，我已经设法将列表列表放入这种格式（json list of dicts）：

[{'id': 0,
  'text': 'Corina Casanova , İsviçre Federal Şansölyesidir .',
  'labels': [[0, 6, 'B-Person'],
   [7, 15, 'I-Person'],
   [18, 25, 'B-Country'],
   [26, 33, 'B-Misc'],
   [34, 47, 'I-Misc']]},
 {'id': 1,
  'text': "Casanova , İsviçre Federal Yüksek Mahkemesi eski Başkanı , Nay Giusep'in pratiğinde bir avukat olarak çalıştı .",
  'labels': [[0, 8, 'B-Person'],
   [11, 18, 'B-Misc'],
   [19, 26, 'I-Misc'],
   [27, 33, 'I-Misc'],
   [34, 43, 'I-Misc'],
   [59, 62, 'B-Person'],
   [63, 72, 'I-Person']]}]

但是，这样做的问题是我想将 IOB 格式合并在一起并创建一个从头到尾的单一实体。我需要这种格式才能在 doccano 注释工具上上传数据。我需要标记为一个的复合实体。

这是我为创建上述格式而编写的代码：

list_json = []

for x, i in enumerate(sentences[0:2]):
    list_json.append({"id": x})
    list_json[x]["text"] = " ".join(i)
    list_json[x]["labels"] = []
    for y, j in enumerate(labels[x]):
        if j in ['B-Person', 'I-Person', 'B-Country'...(private data)]:
            word = i[y]
            wordStartIndex = list_json[x]["text"].find(word)
            wordEndIndex = list_json[x]["text"].index(word) + len(word)
            list_json[x]["labels"].append([wordStartIndex, wordEndIndex, j])

我尝试将上述格式转换为我想要的格式。IE。合并 IOB 标签。这是我迄今为止尝试过的但没有奏效的方法。

new_labels = []

for y, i in enumerate(list_json):
    label_names = [item[2] for item in i["labels"]]
    label_BIO = [item[0] for item in label_names]
    k = 0
    for index in range(len(label_BIO)-1):
        
        if (label_BIO[index] == "B" and label_BIO[index+1] == "I") or (label_BIO[index] == "I" and label_BIO[index+1] == "I"):
            k += 1
    
    for x in range(len(i["labels"])-1):
        
        
        if i["labels"][x][2][0] == "B" and i["labels"][x+1][2][0] == "I":
            new_labels.append([i["labels"][x][0],i["labels"][x+k-1][1],i["labels"][x][2][2:]])
                
        elif i["labels"][x][2][0] != "I" and i["labels"][x+1][2][0] != "I":
            new_labels.append([i["labels"][x][0], i["labels"][x][1], i["labels"][x][2]])

这段代码的问题是我无法确定连续序列的序列长度。所以对于列表中的每个元素 k 总是稳定的。我需要 k 更改同一列表中的下一个序列。

这是我得到的错误：

IndexError                                Traceback (most recent call last)
<ipython-input-93-420750229f93> in <module>
---> 19             new_labels.append([i["labels"][x][0],i["labels"][x+k-1][1],i["labels"][x][2][2:]])
     20 
     21         elif i["labels"][x][2][0] != "I" and i["labels"][x+1][2][0] != "I":

IndexError: list index out of range

我需要确定每次我应该在哪里计算 k。这里的 K 是 B 跟随 I 的序列的长度，依此类推。

我也试过这个，但这只会将两个标签合并在一起：

new_labels = []

for y, i in enumerate(list_json):
    I_labels = []
    for x, j in reversed(list(enumerate(i["labels"]))):
        if j[2][0] == "I" and i["labels"][x-1][2][2:] == j[2][2:]:
            new_labels.append([i["labels"][x-1][0],j[1],j[2][2:]])
        elif j[2][0] != "I" and i["labels"][x+1][2][0] != "I":
            new_labels.append([j[0], j[1], j[2]])

输出：

[[26, 47, 'Misc'],
 [18, 25, 'Country'],
 [0, 15, 'Person'],
 [59, 72, 'Person'],
 [27, 43, 'Misc'],
 [19, 33, 'Misc'],
 [11, 26, 'Misc'],
 [0, 8, 'Person']]

但我需要 3 个“杂项”标签作为索引 11 到 43 的一个标签。

对于任何想知道的人：我尝试这样做的原因是，我已经标记了一些数据并测试了原型模型，它似乎给出了很好的结果。所以我想标记整个数据集并修复错误标签，而不是从头开始注释。我想这会为我节省很多时间。

ps：我知道doccano支持以ConLL格式上传。但是它坏了，所以我不能这样上传。

python - 如何将 Conll 2003 格式转换为 json 格式？

0 回答 0

Related

Reference