0

最近我需要创建一个直方图来显示大型数据集的频率分布。如果数据集很小,这应该是一项简单的工作。但是,我需要绘制的数据集包含大约 800000000 个数字(假设每个数字占用 4 个字节),它们都存储在一个文本文件中,每行一个数字。文本文件大约 4 GB。我尝试了 GNUPLOT,但它抱怨没有足够的内存来处理这个数据集。有人可以建议如何解决这个问题,或任何其他工具来完成这项工作吗?

谢谢,汤姆

4

1 回答 1

0

I'd use python. It's as easy as building a dictionary. Assuming your file contains integers:

from collections import defaultdict

d = defaultdict(int)
with open('datafile') as fin:
    for line in fin:
        d[int(line)] += 1

for item,number_of_occurances in sorted(d.items()):
    print item,number_of_occurances

If you're on a newer version of python, this can be even easier with a Counter:

from collections import Counter
with open('datafile') as fin:
    d = Counter(int(line) for line in fin)

for item,number_of_occurances in sorted(d.items()):
    print item,number_of_occurances
于 2013-01-19T23:54:19.277 回答