1

我有大约 20000 个文本文件,编号为 5.txt、10.txt 等等。

我将这些文件的文件路径存储在我创建的列表“list2”中。

我还有一个包含 500 个单词的文本文件“temp.txt”

vs
mln
money

等等..

我将这些单词存储在我创建的另一个列表“列表”中。

现在我创建一个嵌套字典 d2[file][word]=“file”中“word”的频率计数

现在,

我需要为每个文本文件遍历这些单词,

我正在尝试获得以下输出:

filename.txt- sum(d[filename][word]*log(prob))

在这里,filename.txt 的格式为 5.txt、10.txt 等等......

“概率”,这是我已经获得的值

我基本上需要找到每个外键(文件)的内键(单词)值的总和(即单词的频率)。

说:

d['5.txt']['the']=6

这里“the”是我的话,“5.txt”是文件。现在 6 是“the”在“5.txt”中出现的次数。

相似地:

d['5.txt']['as']=2.

我需要找到字典值的总和。

所以,这里是 5.txt:我需要我的答案是:

6*log(prob('the'))+2*log(prob('as'))+...`(for all the words in list)

我需要对所有文件执行此操作。

我的问题在于我应该遍历嵌套字典的部分

import collections, sys, os, re

sys.stdout=open('4.txt','w')
from collections import Counter
from glob import glob

folderpath='d:/individual-articles'
folderpaths='d:/individual-articles/'
counter=Counter()
filepaths = glob(os.path.join(folderpath,'*.txt'))


#test contains: d:/individual-articles/5.txt,d:/individual,articles/10.txt,d:/individual-articles/15.txt and so on...
with open('test.txt', 'r') as fi:
    list2= [line.strip() for line in fi]


#temp contains the list of words
with open('temp.txt', 'r') as fi:
    list= [line.strip() for line in fi]


#the dictionary that contains d2[file][word]
d2 =defaultdict(dict)
for fil in list2:
    with open(fil) as f:
       path, name = os.path.split(fil)
       words_c = Counter([word for line in f for word in line.split()])
       for word in list:
           d2[name][word] = words_c[word]



#this portion is also for the generation of dictionary "prob",that is generated from file 2.txt can be overlooked!
with open('2.txt', 'r+') as istream:
for line in istream.readlines():
    try:
        k,r = line.strip().split(':')
        answer_ca[k.strip()].append(r.strip())
    except ValueError:
        print('Ignoring: malformed line: "{}"'.format(line))




#my problem lies here
items = d2.items()
small_d2 = dict(next(items) for _ in range(10))
for fil in list2:
    total=0
    for k,v in small_d2[fil].items():
        total=total+(v*answer_ca[k])
    print("Total of {} is {}".format(fil,total))
4

2 回答 2

0

with open(f) as fil将 fil 分配给 f 的任何内容。当您以后访问字典中的条目时

total=sum(math.log(prob)*d2[fil][word].values())

我相信你的意思

total = sum(math.log(prob)*d2[f][word])

不过,这似乎与您期望的顺序不太匹配,所以我建议更像这样的东西:

word_list = [#list of words]
file_list = [#list of files]
dictionary = {#your dictionary}
summation = lambda file_name,prob: sum([(math.log(prob)*dictionary[word][file_name]) for word in word_list])
return_value = []
for file_name in file_list:
    prob = #something
    return_value.append(summation(file_name))

那里的求和行在 python 中定义了一个匿名函数。这些被称为 lambda 函数。从本质上讲,该行的具体含义是:

summation = lambda file_name,prob:

几乎与以下内容相同:

def summation(file_name, prob):

接着

sum([(math.log(prob)*dictionary[word][file_name]) for word in word_list])

几乎与以下内容相同:

result = []
for word in word_list:
    result.append(math.log(prob)*dictionary[word][file_name]
return sum(result)

所以总的来说你有:

    summation = lambda file_name,prob: sum([(math.log(prob)*dictionary[word][file_name]) for word in word_list])

代替:

def summation(file_name, prob):
    result = []
    for word in word_list:
        result.append(math.log(prob)*dictionary[word][file_name])
    return sum(result)

尽管具有列表理解的 lambda 函数比 for 循环实现要快得多。在 python 中很少有应该使用 for 循环而不是列表推导式的情况,但它们确实存在。

于 2013-07-03T19:45:52.853 回答
0
for fil in list2:  #list2 contains the filenames
    total = 0
    for k,v in d[fil].iteritems():
        total += v*log(prob[k])  #where prob is a dict

    print "Total of {} is {}".format(fil,total)
于 2013-07-03T19:45:58.227 回答