python - 如何将 spaCy NER 数据集格式转换为 Flair 格式？

Question

我已经使用 dataturks 标记了一个数据集来训练spaCyNER，一切正常，但是，我刚刚意识到它Flair有不同的格式，我只是想知道是否有办法将我的“spaCy 的 NER”json 数据集格式转换为Flair格式：

George N B-PER
Washington N I-PER
前往
PO
Washington N B-LOC

然而 spaCy 的格式如下：

[("乔治华盛顿去了华盛顿",
{'entities': [(0, 6,'PER'),(7, 17,'PER'),(26, 36,'LOC')]})]

score 1 · Accepted Answer

Flair使用BILUO方案，句子之间有空行，所以你需要使用bliuo_tags_from_offsets：

import spacy
from spacy.gold import biluo_tags_from_offsets
nlp = spacy.load("en_core_web_md")

ents = [("George Washington went to Washington",{'entities': [(0, 6,'PER'),(7, 17,'PER'),(26, 36,'LOC')]}),
         ("Uber blew through $1 million a week", {'entities':[(0, 4, 'ORG')]}),
       ]

with open("flair_ner.txt","w") as f:
    for sent,tags in ents:
        doc = nlp(sent)
        biluo = biluo_tags_from_offsets(doc,tags['entities'])
        for word,tag in zip(doc, biluo):
            f.write(f"{word} {tag}\n")
        f.write("\n")

输出：

George U-PER
Washington U-PER
went O
to O
Washington U-LOC

Uber U-ORG
blew O
through O
$ O
1 O
million O
a O
week O

请注意，仅训练这NER一点似乎就足够了。如果您希望添加 pos 标记，则需要创建从Universal Pos Tags到 Flair 简化方案的映射。例如：

tag_mapping = {'PROPN':'N','VERB':'V','ADP':'P','NOUN':'N'} # create your own
with open("flair_ner.txt","w") as f:
    for pair in ents:
        sent,tags = pair
        doc = nlp(sent)
        biluo = biluo_tags_from_offsets(doc,tags['entities'])
        try:
            for word,tag in zip(doc, biluo):
                f.write(f"{word} {tag_mapping[word.pos_]} {tag}\n")
#                 f.write(f"{word} {tag_mapping.get(word.pos_,'None')} {tag}\n")
        except KeyError:
            print(f"''{word.pos_}' tag is not defined in tag_mapping")
        f.write("\n")

输出：

''SYM' tag is not defined in tag_mapping'

score 0 · Accepted Answer

spaCy v3.0 中使用的主要数据格式是二进制格式，扩展名为 .spacy。JSON 格式已弃用。为了将 BILUO 注释中的 train.spacy 转换为 fair 格式，我创建了一个语料库。

import spacy
from spacy.training import Corpus

nlp = spacy.load("de_core_news_sm")
corpus = Corpus("route/to/train.spacy")

data = corpus(nlp)

# Flair supports BIO and BIOES, see https://github.com/flairNLP/flair/issues/875
def rename_biluo_to_bioes(old_tag):
    new_tag = ""
    try:
        if old_tag.startswith("L"):
            new_tag = "E" + old_tag[1:]
        elif old_tag.startswith("U"):
            new_tag = "S" + old_tag[1:]
        else:
            new_tag = old_tag
    except:
        pass
    return new_tag


def generate_corpus():
    corpus = []
    n_ex = 0
    for example in data:
        n_ex += 1
        text = example.text
        doc = nlp(text)
        tags = example.get_aligned_ner()
        # Check if it's an empty list of NER tags.
        if None in tags:
            pass
        else:
            new_tags = [rename_biluo_to_bioes(tag) for tag in tags]
            for token, tag in zip(doc,new_tags):
                row = token.text +' '+ token.pos_ +' ' +tag + '\n'
                corpus.append(row)
            corpus.append('\n')
    return corpus

def write_file(filepath):
    with open(filepath, 'w', encoding='utf-8') as f:
        corpus = generate_corpus()
        f.writelines(corpus)
        
def main():
    write_file('./data/train.txt')

if __name__ == '__main__':
    main()

我希望它有效。虽然是前一段时间了。

python - 如何将 spaCy NER 数据集格式转换为 Flair 格式？

2 回答 2

Related

Reference