我目前正在尝试将拼写检查步骤添加到 Spacy 的内置管道之一中,特别是'en_core_web_sm'
我发现了一个非常简洁的组件,称为上下文拼写检查,我已将其插入到管道中。问题在于,即使在我将管道重新排序为['tok2vec', 'parser', 'contextual spellchecker', 'tagger', 'attribute_ruler', 'lemmatizer', 'ner']
.
例如:
doc_a = nlp("Income wes $9.4 milion compared to the prior year of $2.7 milion.")
doc_b = nlp("Income was $9.4 milion compared to the prior year of $2.7 milion.")
将返回正确的拼写检查结果:
print(doc_a._.outcome_spellCheck)
# Income was $9.4 million compared to the prior year of $2.7 million.
print(doc_b._.outcome_spellCheck)
# Income was $9.4 million compared to the prior year of $2.7 million.
但是,检查基本结果:
# doc_a with misspelled 'was'. Note lemma is still the original typo 'wes'
print(doc_a.to_json()['tokens'])
# {'id': 1, 'start': 7, 'end': 10, 'tag': 'MD', 'pos': 'AUX', 'morph': 'VerbType=Mod', 'lemma': 'wes', 'dep': 'ROOT', 'head': 1}
# doc_b with correctly spelled 'was'. Correctly lemmatized to 'be'
print(doc_b.to_json()['tokens'])
# {'id': 1, 'start': 7, 'end': 10, 'tag': 'VBD', 'pos': 'AUX', 'morph': 'Mood=Ind|Number=Sing|Person=3|Tense=Past|VerbForm=Fin', 'lemma': 'be', 'dep': 'ROOT', 'head': 1}
如何确保在拼写检查的术语上进行词形还原?