lemmatizer 需要正确的 POS 标签才能准确,如果使用 的默认设置WordNetLemmatizer.lemmatize()
,默认标签是名词,请参阅https://github.com/nltk/nltk/blob/develop/nltk/stem/wordnet.py #L39
要解决这个问题,请在词形还原之前始终对您的数据进行 POS 标记,例如
>>> from nltk.stem import WordNetLemmatizer
>>> from nltk import pos_tag, word_tokenize
>>> wnl = WordNetLemmatizer()
>>> sent = 'This is a foo bar sentence'
>>> pos_tag(word_tokenize(sent))
[('This', 'DT'), ('is', 'VBZ'), ('a', 'DT'), ('foo', 'NN'), ('bar', 'NN'), ('sentence', 'NN')]
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... if not wntag:
... lemma = word
... else:
... lemma = wnl.lemmatize(word, wntag)
... print lemma
...
This
be
a
foo
bar
sentence
注意'is -> be',即
>>> wnl.lemmatize('is')
'is'
>>> wnl.lemmatize('is', 'v')
u'be'
用您的示例中的单词回答问题:
>>> sent = 'These sentences involves some horsing around'
>>> for word, tag in pos_tag(word_tokenize(sent)):
... wntag = tag[0].lower()
... wntag = wntag if wntag in ['a', 'r', 'n', 'v'] else None
... lemma = wnl.lemmatize(word, wntag) if wntag else word
... print lemma
...
These
sentence
involve
some
horse
around
请注意,WordNetLemmatizer 有一些怪癖:
此外,NLTK 的默认 POS 标记器正在进行一些重大更改以提高准确性:
对于 lemmatizer 的开箱即用/现成的解决方案,您可以查看https://github.com/alvations/pywsd以及我如何制作一些 try-excepts 来捕捉单词不在 WordNet 中,请参阅https://github.com/alvations/pywsd/blob/master/pywsd/utils.py#L66