我正在尝试通过先将 adataset
转换.spacy
为doc
然后再转换为DocBin
. 整个dataset
文件可通过GoogleDocs访问。
我运行以下功能:
def converter(data, outputFile):
nlp = spacy.blank("en") # load a new spacy model
doc_bin = DocBin() # create a DocBin object
for text, annot in tqdm(data): # data in previous format
doc = nlp.make_doc(text) # create doc object from text
ents = []
for start, end, label in annot["entities"]: # add character indexes
# supported modes: strict, contract, expand
span = doc.char_span(start, end, label=label, alignment_mode="strict")
# to avoid having the traceback;
# TypeError: object of type 'NoneType' has no len()
if span is None:
pass
else:
ents.append(span)
doc.ents = ents # label the text with the ents
doc_bin.add(doc)
doc_bin.to_disk(f"./{outputFile}.spacy") # save the docbin object
return f"Processed {len(doc_bin)}"
在 上运行该函数后dataset
,我得到了回溯:
ValueError: [E1010] Unable to set entity information for token 27 which is included in more than one span in entities, blocked, missing or outside.
在仔细查看dataset
文件以查找text
引发此回溯的文件后,我发现以下内容:
[('HereLongText..(abstract)',
{'entities': [('0', '27', 'SpecificDisease'),
('80', '93', 'SpecificDisease'),
('260', '278', 'SpecificDisease'),
('615', '628', 'SpecificDisease'),
('673', '691', 'SpecificDisease'),
('754', '772', 'SpecificDisease')]})]
我不知道如何解决这个问题。