python - 使用python删除停用词

Question

全部，

我有一些需要清理的文本，并且我有一个“大部分”有效的小算法。

def removeStopwords(self, data):
    with open(r'stopwords.txt') as stopwords:
        wordList = []
        for i in stopwords:
            wordList.append(i.strip())
        charList = list(data)
        cat = ''.join(char for char in charList if not char in wordList).split()
        return ' '.join(cat)

取本页的第一行。http://en.wikipedia.org/wiki/Paragraph并删除所有我们不感兴趣的字符，在这种情况下都是非字母数字字符。

段落（来自希腊语paragraphos，“写在旁边”或“写在旁边”）是处理特定观点或想法的书面话语的独立单元。一个段落由一个或多个句子组成。[1][2] 段落的开头由换行表示。有时第一行是缩进的。在不同的时间，段落的开头已由 pilcrow 指示：¶。

输出看起来相当不错，只是有些单词的重新组合不正确，我不确定如何更正它。

来自希腊文paragraphos 写在旁边或写在旁边的段落是一个独立的单元

注意“selfcontained”这个词是“self-contained”。

编辑：停用词文件的内容只是一堆字符。

！$ % ^ , & * ( ) { } [ ] <

, . / | \ ? 〜`：; "

事实证明我根本不需要单词列表，因为我只是真的试图删除在这种情况下是标点符号的字符。

        cat = ''.join(data.translate(None, string.punctuation)).split()
        print ' '.join(cat).lower()

score 2 · Accepted Answer

2

版本 2.x

line = 'hello!'
line.translate(None, '!$%') #'hello'

答案

于 2012-02-22T19:45:42.073 回答

score 1 · Accepted Answer

在单独的函数中加载您的停用词/停止字符。

不要硬编码文件名/路径。

您的 wordList 应该是 a set，而不是列表。

但是，如果您使用的是字符而不是单词，请调查 str.translate。

score -2 · Accepted Answer

一种方法是使用 replace 方法并列出您不想要的字符的详尽列表。

例如：

c=['a','h']
a= 'john'
for item in c:
    a =a.replace(item,'')
    print a

打印以下内容：John Jon

python - 使用python删除停用词

3 回答 3

Related

Reference