python - ValueError：无法为包含在多个实体跨度中的令牌 27 设置实体

Question

我正在尝试通过先将 adataset转换.spacy为doc然后再转换为DocBin. 整个dataset文件可通过GoogleDocs访问。

我运行以下功能：

def converter(data, outputFile):
    nlp = spacy.blank("en") # load a new spacy model
    doc_bin = DocBin() # create a DocBin object

    for text, annot in tqdm(data): # data in previous format
        doc = nlp.make_doc(text) # create doc object from text    
        ents = []
        
        for start, end, label in annot["entities"]: # add character indexes
            # supported modes: strict, contract, expand
            span = doc.char_span(start, end, label=label, alignment_mode="strict")
            # to avoid having the traceback; 
            # TypeError: object of type 'NoneType' has no len()
            if span is None:
                pass
            else:
                ents.append(span)
        doc.ents = ents # label the text with the ents
        doc_bin.add(doc)
        
    doc_bin.to_disk(f"./{outputFile}.spacy") # save the docbin object
    return f"Processed {len(doc_bin)}"

在上运行该函数后dataset，我得到了回溯： ValueError: [E1010] Unable to set entity information for token 27 which is included in more than one span in entities, blocked, missing or outside.

在仔细查看dataset文件以查找text引发此回溯的文件后，我发现以下内容：

[('HereLongText..(abstract)',
  {'entities': [('0', '27', 'SpecificDisease'),
    ('80', '93', 'SpecificDisease'),
    ('260', '278', 'SpecificDisease'),
    ('615', '628', 'SpecificDisease'),
    ('673', '691', 'SpecificDisease'),
    ('754', '772', 'SpecificDisease')]})]

我不知道如何解决这个问题。

score 1 · Accepted Answer

我认为这应该使您的问题清楚。这是您的代码稍作修改的版本，该版本具有相同的错误。

import spacy
from spacy.tokens import DocBin
from tqdm import tqdm

def converter(data, outputFile):
    nlp = spacy.blank("en")  # load a new spacy model
    doc_bin = DocBin()  # create a DocBin object

    for text, annot in tqdm(data):  # data in previous format
        doc = nlp.make_doc(text)  # create doc object from text
        ents = []

        for start, end, label in annot["entities"]:  # add character indexes
            # supported modes: strict, contract, expand

            span = doc.char_span(start, end, label=label, alignment_mode="strict")
            # to avoid having the traceback;
            # TypeError: object of type 'NoneType' has no len()
            if span is None:
                pass
            else:
                ents.append(span)
        doc.ents = ents  # label the text with the ents
        doc_bin.add(doc)

    doc_bin.to_disk(f"./{outputFile}.spacy")  # save the docbin object
    return f"Processed {len(doc_bin)}"


data = [("I like cheese", 
    {"entities": [
        (0, 1, "Sample"),
        (0, 1, "Sample"), # Same thing twice
        ]})]

converter(data, "out.txt")

请注意，在示例中，完全相同的跨度有两个注释。如果您删除其中一个注释，那么您将不会收到错误消息。

您可能会收到错误消息，因为您的注释重叠且不可用。

python - ValueError：无法为包含在多个实体跨度中的令牌 27 设置实体

1 回答 1

Related

Reference