python - python中停用词的实现

Question

python中的停用词列表更快：

 stopwords = ('a','and', 'etc')

或使用文件来调用它？

score 2 · Accepted Answer

NLTK有停用词作为列表。

nltk.corpus.stopwords.words('english')

如果这就是您的意思，它比使用文件并在遍历停用词时从中读取要快。

score 1 · Accepted Answer

如果您不想下载nltk，到处都可以找到停用词文件。他们通常每行列出一个单词，因此很容易将它们放在自己的结构中。

stopwords = ()
for line in open('stopwordfile'):
    stopwords += (line,)

但是，比在元组中查找单词更快的是使用字典，可能最好使用默认返回值：

stopdict = {w:True for w in stopwords}

for word in text_you_want_to_index:
     if word not in stopdict:          # or: not stopdict.get(word, False): don't know which one more performant
          print word

score 1 · Accepted Answer

文件操作总是比正常的代码执行慢得多。因此，如果您需要的数据足够小，请不要使用文件。

如果以下任何一项为真，您将使用文件：

需要修改输入数据而不改变实际代码
需要处理的大量数据
数据正在由另一个进程/应用程序提供

如果您只有有限数量的停用词，并且您不需要经常更改它们，那么请始终使用

stopwords = ('a','and', 'etc')

python - python中停用词的实现

3 回答 3

Related

Reference