python - 计算多个文档中的词频python

Question

我有一个字典'd'中多个文本文件的地址列表：

'd:/individual-articles/9.txt', 'd:/individual-articles/11.txt', 'd:/individual-articles/12.txt',...

等等...

现在，我需要阅读字典中的每个文件，并保留整个字典中出现的每个单词的单词出现列表。

我的输出应该是以下形式：

the-500

a-78

in-56

等等..

其中 500 是单词“the”在字典中所有文件中出现的次数..等等..

我需要对所有单词都这样做。

我是一个python新手..请帮助！

我下面的代码不起作用，它没有显示输出！我的逻辑一定有错误，请更正！

import collections
import itertools
import os
from glob import glob
from collections import Counter




folderpaths='d:/individual-articles'
counter=Counter()


filepaths = glob(os.path.join(folderpaths,'*.txt'))




folderpath='d:/individual-articles/'
# i am creating my dictionary here, can be ignored
d = collections.defaultdict(list)
with open('topics.txt') as f:
    for line in f:
       value, *keys = line.strip().split('~')
        for key in filter(None, keys):
            if key=='earn':
               d[key].append(folderpath+value+".txt")

   for key, value in d.items() :
        print(value)


word_count_dict={}

for file in d.values():
    with open(file,"r") as f:
        words = re.findall(r'\w+', f.read().lower())
        counter = counter + Counter(words)
        for word in words:
            word_count_dict[word].append(counter)              


for word, counts in word_count_dict.values():
    print(word, counts)

score 1 · Accepted Answer

灵感来自Counter您使用的收藏：

from glob import glob
from collections import Counter
import re

folderpaths = 'd:/individual-articles'
counter = Counter()

filepaths = glob(os.path.join(folderpaths,'*.txt'))
for file in filepaths:
    with open(file) as f:
        words = re.findall(r'\w+', f.read().lower())
        counter = counter + Counter(words)
print counter

score 0 · Accepted Answer

你的代码应该在这一行给你一个错误：

word_count_dict[word][file]+= 1

因为你word_count_dict是空的，所以当你这样做时，word_count_dict[word][file]你应该得到一个关键错误，因为word_count_dict[word]不存在，所以你可以[file]对它做。

我发现了另一个错误：

while file in d.items():

这将使文件成为一个元组。但是你这样做了f = open(file,"r")，所以你假设file是一个字符串。这也会引发错误。

这意味着这些行都不会被执行。这反过来意味着要么while file in d.items():是空的，要么是file in filepaths:空的。

老实说，我不明白你为什么同时拥有它们。我不明白你想在那里实现什么。您已生成要解析的文件名列表。你应该迭代它们。我也不知道为什么d是字典。您所需要的只是所有文件的列表。您不需要跟踪文件的密钥来自主题、列表的时间，对吗？

python - 计算多个文档中的词频python

2 回答 2

Related

Reference