我最近有一个类似的任务,我是这样做的:
import spacy
nlp = spacy.load('en_core_news_sm')
def text_to_doccano(text):
"""
:text (str): source text
Returns (list (dict)): deccano format json
"""
djson = list()
doc = nlp(text)
for sent in doc.sents:
labels = list()
for e in sent.ents:
labels.append([e.start_char, e.end_char, e.label_])
djson.append({'text': sent.text, "labels": labels})
return djson
根据您的示例...
text = "Test text that should be annotated for Michael Schumacher."
djson = text_to_doccano(text)
print(djson)
...会打印出来:
[{'text': 'Test text that should be annotated for Michael Schumacher.', 'labels': [[39, 57, 'PERSON']]}]
在相关说明中,当您将结果保存到文件时json.dump
,保存 JSON 的标准方法将不起作用,因为它会将其写为用逗号分隔的条目列表。AFAIK,doccano
预计每行一个条目,并且没有尾随逗号。在解决这个问题时,下面的代码片段就像魅力一样:
import json
open(filepath, 'w').write("\n".join([json.dumps(e) for e in djson]))
/干杯