python - 为什么在使用 spacy 进行词干提取/词形还原时我们不能得到一致的结果？

Question

这是我的python代码：

import spacy
nlp = spacy.load('en')
line = u'Algorithms; Deterministic algorithms; Adaptive algorithms; Something...'
line = line.lower()
print ' '.join([token.lemma_ for token in nlp(line)])

输出是：

algorithm ; deterministic algorithm ; adaptive algorithms ; something...

为什么第三个algorithms没有转换为“算法”？当我删除lower()功能时，我得到了这个：

algorithms ; deterministic algorithms ; adaptive algorithm ; something...

这次第一个和第二个algorithms无法转换。这个问题让我发疯，我该如何解决这个问题才能让每个单词都被词形还原？

score 2 · Accepted Answer

你用的是什么版本？有了lower它对我来说可以正常工作：

>>> doc = nlp(u'Algorithms; Deterministic algorithms; Adaptive algorithms; Something...'.lower())
>>> for word in doc:
...   print(word.text, word.lemma_, word.tag_)
... 
(u'algorithms', u'algorithm', u'NNS')
(u';', u';', u':')
(u'deterministic', u'deterministic', u'JJ')
(u'algorithms', u'algorithm', u'NNS')
(u';', u';', u':')
(u'adaptive', u'adaptive', u'JJ')
(u'algorithms', u'algorithm', u'NN')
(u';', u';', u':')
(u'something', u'something', u'NN')
(u'...', u'...', u'.')

如果没有lower，标注器会分配Algorithms标签 NNP，即专有名词。这可以防止词形还原，因为模型在统计上已经猜测该词是专有名词。

如果你愿意，你可以在分词器中设置一个特殊情况规则来告诉 spaCy 这Algorithms绝不是一个专有名词。

from spacy.attrs import POS, LEMMA, ORTH, TAG
nlp = spacy.load('en')

nlp.tokenizer.add_special_case(u'Algorithms', [{ORTH: u'Algorithms', LEMMA: u'algorithm', TAG: u'NNS', POS: u'NOUN'}])
doc = nlp(u'Algorithms; Deterministic algorithms; Adaptive algorithms; Something...')
for word in doc:
    print(word.text, word.lemma_, word.tag_)
(u'Algorithms', u'algorithm', u'NNS')
(u';', u';', u':')
(u'Deterministic', u'deterministic', u'JJ')
(u'algorithms', u'algorithm', u'NN')
(u';', u';', u':')
(u'Adaptive', u'adaptive', u'JJ')
(u'algorithms', u'algorithm', u'NNS')
(u';', u';', u':')
(u'Something', u'something', u'NN')
(u'...', u'...', u'.')

该tokenizer.add_special_case函数允许您指定如何对字符串进行标记，并在每个子标记上设置属性。

score 0 · Accepted Answer

我认为三段论_解释得更好。但这里有另一种方式：

from nltk.stem import WordNetLemmatizer

lemma = WordNetLemmatizer()
line = u'Algorithms; Deterministic algorithms; Adaptive algorithms; Something...'.lower().split(';')
line = [a.strip().split(' ') for a in line]
line = [map(lambda x: lemma.lemmatize(x), l1) for l1 in line ]
print line

输出：

[[u'algorithm'], [u'deterministic', u'algorithm'], [u'adaptive', u'algorithm'], [u'something...']]

python - 为什么在使用 spacy 进行词干提取/词形还原时我们不能得到一致的结果？

2 回答 2

Related

Reference