python - python中有更好的预处理库或实现吗？

Question

我需要预处理一些文本文档，以便我可以应用 fcm 等分类技术和潜在 dirichlet 分配等其他主题建模技术

为了详细说明预处理，我需要删除停用词，提取名词和关键字并执行词干提取。我用于此目的的代码是：

#--------------------------------------------------------------------------
#Extracting nouns
#--------------------------------------------------------------------------
for i in range (0,len(a)) :
    x=a[i]          
    text=nltk.pos_tag(nltk.Text(nltk.word_tokenize(x)))
    for noun in text:
        if(noun[1]=="NN" or noun[1]=="NNS"):
            temp+=noun[0]
            temp+=' '
documents.append(temp)
print documents

#--------------------------------------------------------------------------
#remove unnecessary words and tags
#--------------------------------------------------------------------------

texts = [[word for word in document.lower().split() if word not in stoplist]for    document in documents]
allTokens = sum(texts, [])
tokensOnce = set(word for word in set(allTokens) if allTokens.count(word)== 0)
texts = [[word for word in text if word not in tokensOnce]for text in texts]
print texts

#--------------------------------------------------------------------------
#Stemming
#--------------------------------------------------------------------------

for i in texts:
    for j in range (0,len(i)):        
        k=porter.stem(i[j])
        i[j]=k
print texts

我上面提到的代码的问题是

用于提取名词和关键字的 nltk 模块缺少许多单词。例如，对某些文档进行了预处理，并且诸如“Sachin”之类的名称在预处理后未被识别为关键字并被遗漏。
词干不正确。有太多的词干（网络和网络到网络），有时名词也会被词干。

是否有更好的模块来满足所需的功能，或者是否有更好的相同模块实现？请帮助

score 2 · Accepted Answer

2

试试 Pattern，我真的很喜欢： http: //www.clips.ua.ac.be/pages/pattern

于 2012-04-26T10:59:42.267 回答

python - python中有更好的预处理库或实现吗？

1 回答 1

Related

Reference