python - 计算大数据文件python的简单方法

Question

我必须从一个大文件中计算数据。文件有大约 100000 行和 3 列。下面的程序适用于小测试文件，但是当尝试使用大文件运行时，即使显示一个结果也需要很长时间。任何加快大数据文件加载和计算的建议。

代码：小测试文件计算完美，输入格式如下

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)
pairper = defaultdict(float)

#get number of pair occrences and total time 
with open('input.txt', 'r') as f:
  with open('output.txt', 'w') as o: 
    numline = 0
    for line in f:
        numline += 1
            line = line.split()
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])
        pairper = dict((pair, c * 100.0 / numline) for (pair, c) in paircount.iteritems())

    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c, pairper[pair], pairtime[pair]))

输入文件：

5372 2684 460.0
1885 1158 351.0
1349 1174 6375.0
1980 1174 650.0
1980 1349 650.0
4821 2684 469.0
4821 937  459.0
2684 937  318.0
1980 606  390.0
1349 606  750.0
1174 606  750.0

score 1 · Accepted Answer

配对计算正在杀死您，并且不需要。您可以使用 enumerate 来计算输入行数，并在最后使用该值。这类似于 martineau 的答案，只是它不会将整个输入列表拉入内存（坏主意），甚至根本不会计算配对器。

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)

#get number of pair occrences and total time 
with open('input.txt', 'r') as f:
  with open('output.txt', 'w') as o: 
    for numline, line in enumerate(f, 1):
        line = line.split()
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])

    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c, c * 100.0 / numline, pairtime[pair]))

score 1 · Accepted Answer

缓慢的主要原因是因为您为perpair字典中的每一行重新创建字典，paircount字典变得越来越大，这不是必需的，因为只有在处理所有行之后计算的值才会被使用。

我不完全理解所有计算是什么，但这里有一些等效的东西应该运行得更快，因为它只创建pairper一次字典。我还稍微简化了逻辑，虽然这可能不会对运行时间产生太大影响，但我认为它更容易理解。

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)

#get number of pair occurrences and total time
with open('easy_input.txt', 'r') as f, open('easy_output.txt', 'w') as o:
    for numline, line in enumerate((line.split() for line in f), start=1):
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])

    pairper = dict((pair, c * 100.0 / numline) for (pair, c)
                                                in paircount.iteritems())
    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c,
                                          pairper[pair], pairtime[pair]))
print 'done'

python - 计算大数据文件python的简单方法

2 回答 2

Related

Reference