我正在使用雪球词干分析器来词干文档中的单词,如下面的代码片段所示。
stemmer = EnglishStemmer()
# Stem, lowercase, substitute all punctuations, remove stopwords.
attribute_names = [stemmer.stem(token.lower()) for token in wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower() not in stopwords.words('english')]
当我在 Eclipse 中使用 PyDev 在文档上运行它时,我没有收到任何错误。当我在终端(Mac OSX)中运行它时,我收到以下错误。有人可以帮忙吗?
File "data_processing.py", line 171, in __filter__
attribute_names = [stemmer.stem(token.lower()) for token in wordpunct_tokenize(re.sub('[%s]' % re.escape(string.punctuation), '', doc)) if token.lower() not in stopwords.words('english')]
File "7.3/lib/python2.7/site-packages/nltk-2.0.4-py2.7.egg/nltk/stem/snowball.py", line 694, in stem
word = (word.replace(u"\u2019", u"\x27")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc2 in position 7: ordinal not in range(128)