1

我正在尝试做一个干净的文档操作来删除停用词,下面的 pos 标记和词干是我的代码

 def cleanDoc(doc):
    stopset = set(stopwords.words('english'))
    stemmer = nltk.PorterStemmer()
    #Remove punctuation,convert lower case and split into seperate words
    tokens = re.findall(r"<a.*?/a>|<[^\>]*>|[\w'@#]+", doc.lower() ,flags = re.UNICODE | re.LOCALE)
    #Remove stopwords and words < 2
    clean = [token for token in tokens if token not in stopset and len(token) > 2]
    #POS Tagging
    pos = nltk.pos_tag(clean)
    #Stemming
    final = [stemmer.stem(word) for word in pos]
    return final

我收到了这个错误:

Traceback (most recent call last):
  File "C:\Users\USer\Desktop\tutorial\main.py", line 38, in <module>
    final = cleanDoc(doc)
  File "C:\Users\USer\Desktop\tutorial\main.py", line 30, in cleanDoc
    final = [stemmer.stem(word) for word in pos]
  File "C:\Python27\lib\site-packages\nltk\stem\porter.py", line 556, in stem
    stem = self.stem_word(word.lower(), 0, len(word) - 1)
AttributeError: 'tuple' object has no attribute 'lower'
4

2 回答 2

5

在这一行:

pos = nltk.pos_tag(clean)

nltk.pos_tag()返回元组列表(word, tag),而不是字符串。使用它来获取单词:

pos = nltk.pos_tag(clean)
final = [stemmer.stem(tagged_word[0]) for tagged_word in pos]
于 2013-04-17T13:32:08.303 回答
2

nltk.pos_tag返回元组列表,而不是字符串列表。也许你想要

final = [stemmer.stem(word) for word, _ in pos]
于 2013-04-17T13:31:12.447 回答