python - 在 Python 中删除停用词的更快方法

Question

我正在尝试从一串文本中删除停用词：

from nltk.corpus import stopwords
text = 'hello bye the the hi'
text = ' '.join([word for word in text.split() if word not in (stopwords.words('english'))])

我正在处理 600 万个这样的字符串，所以速度很重要。分析我的代码，最慢的部分是上面的行，有没有更好的方法来做到这一点？我正在考虑使用正则表达式之类的东西，re.sub但我不知道如何为一组单词编写模式。有人可以帮我一把，我也很高兴听到其他可能更快的方法。

注意：我尝试了某人的包装建议，stopwords.words('english')但这set()没有任何区别。

谢谢你。

score 103 · Accepted Answer

尝试缓存停用词对象，如下所示。每次调用函数时都构建它似乎是瓶颈。

    from nltk.corpus import stopwords

    cachedStopWords = stopwords.words("english")

    def testFuncOld():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in stopwords.words("english")])

    def testFuncNew():
        text = 'hello bye the the hi'
        text = ' '.join([word for word in text.split() if word not in cachedStopWords])

    if __name__ == "__main__":
        for i in xrange(10000):
            testFuncOld()
            testFuncNew()

我通过分析器运行了这个：python -m cProfile -s 累积 test.py。相关线路张贴在下面。

nCalls 累计时间

10000 7.723 字.py:7(testFuncOld)

10000 0.140 字.py:11(testFuncNew)

因此，缓存停用词实例可以提高约 70 倍的速度。

score 22 · Accepted Answer

使用正则表达式删除所有不匹配的单词：

import re
pattern = re.compile(r'\b(' + r'|'.join(stopwords.words('english')) + r')\b\s*')
text = pattern.sub('', text)

这可能比循环自己快得多，尤其是对于大输入字符串。

如果文本中的最后一个单词被此删除，则可能有尾随空格。我建议分开处理。

score 19 · Accepted Answer

抱歉回复晚了。将证明对新用户有用。

使用集合库创建停用词词典

使用该字典进行非常快速的搜索（时间 = O（1）），而不是在列表中进行（时间 = O（停用词））

from collections import Counter
stop_words = stopwords.words('english')
stopwords_dict = Counter(stop_words)
text = ' '.join([word for word in text.split() if word not in stopwords_dict])

score 5 · Accepted Answer

首先，您要为每个字符串创建停用词。创建一次。在这里确实很棒。

forbidden_words = set(stopwords.words('english'))

后来，摆脱了[]里面join。改用生成器。

代替

' '.join([x for x in ['a', 'b', 'c']])

和

' '.join(x for x in ['a', 'b', 'c'])

接下来要处理的是生成.split()值而不是返回数组。~~我相信regex在这里会是很好的替代品。~~请参阅thist thread了解为什么s.split()实际上很快。

最后，并行执行这样的工作（删除 6m 字符串中的停用词）。那是一个完全不同的话题。

score 0 · Accepted Answer

尝试通过避免循环来使用它，而是使用正则表达式来删除停用词：

import re
from nltk.corpus import stopwords

cachedStopWords = stopwords.words("english")
pattern = re.compile(r'\b(' + r'|'.join(cachedStopwords) + r')\b\s*')
text = pattern.sub('', text)

score 0 · Accepted Answer

到目前为止，仅使用常规 dict 似乎是最快的解决方案。
甚至超过 Counter 解决方案约 10%

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'
text = " ".join([word for word in text.split() if word not in stopwords_dict])

使用 cProfile 分析器测试

您可以在此处找到使用的测试代码： https ://gist.github.com/maxandron/3c276924242e7d29d9cf980da0a8a682

编辑：

最重要的是，如果我们用循环替换列表推导，我们的性能会再提高 20%

from nltk.corpus import stopwords
stopwords_dict = {word: 1 for word in stopwords.words("english")}
text = 'hello bye the the hi'

new = ""
for word in text.split():
    if word not in stopwords_dict:
        new += word
text = new

python - 在 Python 中删除停用词的更快方法

6 回答 6

Related

Reference