python - 如何在多个文件中找到不重复的词频？

Question

我正在尝试查找文件夹中多个文件中单词的频率，如果在文件中找到它，我需要将单词的计数增加 1。例如：如果在文件 1 中读取“一切都很好”，则必须将“好”的计数增加 1 而不是 2，如果在文件 2 中读取“她不好”，则“好”的计数将成为 2

我需要在不包括重复项的情况下增加计数器，但我的程序没有考虑到这一点，所以请帮忙！

import os
import re
import sys
sys.stdout=open('f1.txt','w')
from collections import Counter
from glob import glob

def removegarbage(text):
    text=re.sub(r'\W+',' ',text)
    text=text.lower()
    sorted(text)
    return text

def removeduplicates(l):
    return list(set(l))


folderpath='d:/articles-words'
counter=Counter()


filepaths = glob(os.path.join(folderpath,'*.txt'))

num_files = len(filepaths)

# Add all words to counter
for filepath in filepaths:
    with open(filepath,'r') as filehandle:
        lines = filehandle.read()
        words = removegarbage(lines).split()
        cwords=removeduplicates(words)
        counter.update(cwords)

# Display most common
for word, count in counter.most_common():

    # Break out if the frequency is less than 0.1 * the number of files
    if count < 0.1*num_files:
        break
    print('{}  {}'.format(word,count))

我已经尝试过排序和删除重复技术，但它仍然不起作用！

score 0 · Accepted Answer

我会以非常不同的方式来做，但关键是使用一套。

frequency = Counter()
for line in open("file", "r"):
    for word in set(line):
        frequency[word] += 1

我不确定它是否更可取.readline()；我通常使用 for 循环，因为它们非常简单。

编辑：我明白你做错了什么。.read()您使用, （removegarbage()在其上执行）读取文件的全部内容，然后读取.split()结果。这会给你一个列表，破坏换行符：

>>> "Hello world!\nFoo bar!".split()
['Hello', 'world!', 'Foo', 'bar!']

score 0 · Accepted Answer

如果我正确理解你的问题，基本上你想知道每个单词，它在所有文件中出现了多少次（不管同一个单词在同一个文件中是否不止一次）。为了做到这一点，我做了以下模式，它模拟了许多文件的列表（我只关心这个过程，而不是文件本身，所以你可能必须设法更改实际列表的“文件”你想处理。

d = {}
i = 0 
for f in files:
    i += 1
    for line in f:   
        words = line.split()
        for word in words:
            if word not in d:
                d[word] = {}
            d[word][i] = 1    

d2 = {}
for word,occurences in d.iteritems():
    d2[word] = sum( d[word].values() )

结果将为您提供如下内容： {'ends': 1, 'that': 1, 'is': 1, 'well': 2, 'she': 1, 'not': 1, "all's" : 1}

python - 如何在多个文件中找到不重复的词频？

2 回答 2

Related

Reference