python - 多个文件中的词频计算

Question

我正在编写一个代码，以计算包含大约 20000 个文件的文档中单词出现的频率，我能够获得文档中单词的总体频率，到目前为止我的代码是：

import os
import re
import sys
sys.stdout=open('f2.txt','w')
from collections import Counter
from glob import iglob

def removegarbage(text):
    text=re.sub(r'\W+',' ',text)
    text=text.lower()
    return text

folderpath='d:/articles-words'
counter=Counter()

for filepath in iglob(os.path.join(folderpath,'*.txt')):
    with open(filepath,'r') as filehandle:
        counter.update(removegarbage(filehandle.read()).split())

for word,count in counter.most_common():
    print('{}  {}'.format(word,count))

但是，我想修改我的计数器，并且只为每个文件更新一次，即 count 必须对应于 0 或 1，以便在文档中的文件中出现或不出现。例如：单词“little”在 file1 中出现 3 次，在 file45 中出现 8 次，因此计数值必须是 2 而不是 11，但我现在的代码显示 11。

score 4 · Accepted Answer

使用sets：

for filepath in iglob(os.path.join(folderpath,'*.txt')):
    with open(filepath,'r') as filehandle:
        words = set(removegarbage(filehandle.read()).split()) 
        counter.update(words)

Aset仅包含唯一值：

>>> strs = "foo bat foo"
>>> set(strs.split())
set(['bat', 'foo'])

使用示例collections.Counter：

>>> c = Counter()
>>> strs = "foo bat foo"
>>> c.update(set(strs.split()))
>>> strs = "foo spam foo"
>>> c.update(set(strs.split()))
>>> c
Counter({'foo': 2, 'bat': 1, 'spam': 1})

python - 多个文件中的词频计算

1 回答 1

Related

Reference