当您在 spacy 的(v3.0.5)英语语言模型中分配分词器时,en_core_web_sm
它自己的默认分词器会改变其行为。
您会期望没有任何变化,但它会默默地失败。为什么是这样?
重现代码:
import spacy
text = "don't you're i'm we're he's"
# No tokenizer assignment, everything is fine
nlp = spacy.load('en_core_web_sm')
doc = nlp(text)
[t.lemma_ for t in doc]
>>> ['do', "n't", 'you', 'be', 'I', 'be', 'we', 'be', 'he', 'be']
# Default Tokenizer assignent, tokenization and therefore lemmatization fails
nlp = spacy.load('en_core_web_sm')
nlp.tokenizer = spacy.tokenizer.Tokenizer(nlp.vocab)
doc = nlp(text)
[t.lemma_ for t in doc]
>>> ["don't", "you're", "i'm", "we're", "he's"]