您不需要用于 Java 库的 Python 包装器,nltk 有 Snowball!:)
>>> from nltk.stem import SnowballStemmer as SS
>>> stemmer = SS('english')
>>> stemmer.stem('dance')
u'danc'
>>> stemmer.stem('danced')
u'danc'
>>> stemmer.stem('dancing')
u'danc'
>>> stemmer.stem('dancer')
u'dancer'
>>> stemmer.stem('accordance')
u'accord'
词干并不总是会给你确切的根源,但它是一个很好的开始。
以下是使用词干的示例。我正在构建一个字典,stem: (word, count)
同时为每个词干选择最短的单词。So ['dancing', 'danced', 'dances', 'dance', 'dancer'] converts to {'danc': ('dance', 4), 'dancer': ('dancer', 1)}
示例代码:(取自http://en.wikipedia.org/wiki/Dance的文本)
import re
from nltk.stem import SnowballStemmer as SS
text = """Dancing has evolved many styles. African dance is interpretative.
Ballet, ballroom (such as the waltz), and tango are classical styles of dance
while square dancing and the electric slide are forms of step dances.
More recently evolved are breakdancing and other forms of street dance,
often associated with hip hop culture.
Every dance, no matter what style, has something in common.
It not only involves flexibility and body movement, but also physics.
If the proper physics are not taken into consideration, injuries may occur."""
#extract words
words = [word.lower() for word in re.findall(r'\w+',text)]
stemmer = SS('english')
counts = dict()
#count stems and extract shortest words possible
for word in words:
stem = stemmer.stem(word)
if stem in counts:
shortest,count = counts[stem]
if len(word) < len(shortest):
shortest = word
counts[stem] = (shortest,count+1)
else:
counts[stem]=(word,1)
#convert {key: (word, count)} to [(word, count, key)] for convenient sort and print
output = [wordcount + (root,) for root,wordcount in counts.items()]
#trick to sort output by count (descending) & word (alphabetically)
output.sort(key=lambda x: (-x[1],x[0]))
for item in output:
print '%s:%d (Root: %s)' % item
输出:
dance:7 (Root: danc)
and:4 (Root: and)
are:4 (Root: are)
of:3 (Root: of)
style:3 (Root: style)
the:3 (Root: the)
evolved:2 (Root: evolv)
forms:2 (Root: form)
has:2 (Root: has)
not:2 (Root: not)
physics:2 (Root: physic)
african:1 (Root: african)
also:1 (Root: also)
as:1 (Root: as)
associated:1 (Root: associ)
ballet:1 (Root: ballet)
ballroom:1 (Root: ballroom)
body:1 (Root: bodi)
breakdancing:1 (Root: breakdanc)
---truncated---
我不建议针对您的特定需求进行词形还原:
>>> from nltk.stem.wordnet import WordNetLemmatizer
>>> lmtzr = WordNetLemmatizer()
>>> lmtzr.lemmatize('dance')
'dance'
>>> lmtzr.lemmatize('dancer')
'dancer'
>>> lmtzr.lemmatize('dancing')
'dancing'
>>> lmtzr.lemmatize('dances')
'dance'
>>> lmtzr.lemmatize('danced')
'danced'
子字符串不是一个好主意,因为它总是会在某个点失败,而且很多时候都失败得很惨。
- 固定长度:伪词 'dancitization' 和 'dancendence' 将分别匹配 4 和 5 个字符的 'dance'。
- 比率:低比率将返回假货(如上)
- 比率:高比率将不够匹配(例如“运行”)
但是有了词干,你会得到:
>>> stemmer.stem('dancitization')
u'dancit'
>>> stemmer.stem('dancendence')
u'dancend'
>>> #since dancitization gives us dancit, let's try dancization to get danc
>>> stemmer.stem('dancization')
u'dancize'
>>> stemmer.stem('dancation')
u'dancat'
对于词干“danc”来说,这是一个令人印象深刻的不匹配结果。即使考虑到“dancer”并不源于“danc”,总体而言准确度还是相当高的。
我希望这可以帮助您入门。