-1

在尝试将 spaCy NER 数据集格式转换为 Flair 格式时,使用以下代码:

from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

ents = TRAIN_DATA

with open("flair_ner.txt","w") as f:
    for sent,tags in ents:
        doc = nlp(sent)
        biluo = biluo_tags_from_offsets(doc,tags['entities'])
        for word,tag in zip(doc, biluo):
            f.write(f"{word} {tag}\n")
        f.write("\n")

我遇到重叠错误:

ValueError: [E103] Trying to set conflicting doc.ents: '(1155, 1199, 'Email Address')' and '(1143, 1240, 'Links')'. A token can only be part of one entity, so make sure the entities you're setting don't overlap.

这是示例:

[('Afreen Jamadar\nActive member of IIIT Committee in Third year\n\nSangli, Maharashtra - Email me on Indeed: indeed.com/r/Afreen-Jamadar/8baf379b705e37c6\n\nI wish to use my knowledge, skills and conceptual understanding to create excellent team\nenvironments and work consistently achieving organization objectives believes in taking initiative\nand work to excellence in my work.\n\nWORK EXPERIENCE\n\nActive member of IIIT Committee in Third year\n\nCisco Networking -  Kanpur, Uttar Pradesh\n\norganized by Techkriti IIT Kanpur and Azure Skynet.\nPERSONALLITY TRAITS:\n• Quick learning ability\n• hard working\n\nEDUCATION\n\nPG-DAC\n\nCDAC ACTS\n\n2017\n\nBachelor of Engg in Information Technology\n\nShivaji University Kolhapur -  Kolhapur, Maharashtra\n\n2016\n\nSKILLS\n\nDatabase (Less than 1 year), HTML (Less than 1 year), Linux. (Less than 1 year), MICROSOFT\nACCESS (Less than 1 year), MICROSOFT WINDOWS (Less than 1 year)\n\nADDITIONAL INFORMATION\n\nTECHNICAL SKILLS:\n\n• Programming Languages: C, C++, Java, .net, php.\n• Web Designing: HTML, XML\n• Operating Systems: Windows […] Windows Server 2003, Linux.\n• Database: MS Access, MS SQL Server 2008, Oracle 10g, MySql.\n\nhttps://www.indeed.com/r/Afreen-Jamadar/8baf379b705e37c6?isid=rex-download&ikw=download-top&co=IN',
  {'entities': [(1155, 1199, 'Email Address'),
    (1143, 1240, 'Links'),
    (743, 1141, 'Skills'),
    (729, 733, 'Graduation Year'),
    (706, 728, 'Location'),
    (675, 703, 'College Name'),
    (631, 673, 'Degree'),
    (625, 630, 'Graduation Year'),
    (614, 623, 'College Name'),
    (606, 612, 'Degree'),
    (458, 479, 'Location'),
    (438, 454, 'Companies worked at'),
    (104, 148, 'Email Address'),
    (62, 68, 'Location'),
    (0, 14, 'Name')]}),
4

1 回答 1

0

来自prodigy/spacy 支持

实体识别器仅限于预测非重叠、非嵌套 >spans。训练数据应遵循相同的约束。如果您愿意,您可以 > 在数据中包含两个带有不同注释的句子。不过,我不确定 > 这是否会伤害或帮助您的表现。

我可以从错误消息中看到email(start span:1155, end span:1199) 和links(start span:1143, end span:1240) 的跨度重叠。您需要先解决重叠注释,然后才能使用您的代码。

于 2020-12-17T16:40:23.570 回答