python - 在python中读取一个大文件

Question

我有一个“不那么”的大文件（~2.2GB），我正在尝试读取和处理......

graph = defaultdict(dict)
error = open("error.txt","w")
print "Reading file"
with open("final_edge_list.txt","r") as f:
    for line in f:
        try:
            line = line.rstrip(os.linesep)
            tokens = line.split("\t")
            if len(tokens)==3:
                src = long(tokens[0])
                destination = long(tokens[1])
                weight = float(tokens[2])
                #tup1 = (destination,weight)
                #tup2 = (src,weight)
                graph[src][destination] = weight
                graph[destination][src] = weight
            else:
                print "error ", line 
                error.write(line+"\n")
        except Exception, e:
            string = str(Exception) + " " + str(e) +"==> "+ line +"\n"
            error.write(string)
            continue

难道我做错了什么？？

它已经像一个小时..因为代码正在读取文件..（它仍在阅读..）

并且跟踪内存使用量已经是 20GB.. 为什么要花这么多时间和内存？

score 3 · Accepted Answer

要大致了解内存的去向，您可以使用该gc.get_objects函数。将上面的代码包装在一个make_graph()函数中（无论如何这是最佳实践），然后使用异常处理程序包装对该函数的调用，该KeyboardInterrupt异常处理程序将 gc 数据打印到文件中。

def main():
    try:
        make_graph()
    except KeyboardInterrupt:
        write_gc()

def write_gc():
    from os.path import exists
    fname = 'gc.log.%i'
    i = 0
    while exists(fname % i):
        i += 1
    fname = fname % i
    with open(fname, 'w') as f:
        from pprint import pformat
        from gc import get_objects
        f.write(pformat(get_objects())


if __name__ == '__main__':
    main()

现在，每当您 ctrl+c 程序时，您都会得到一个新的 gc.log。给定一些示例，您应该能够看到内存问题。

score 2 · Accepted Answer

与其他编程语言相比，Python 的数值类型使用了大量的内存。对于我的设置，每个数字似乎是 24 个字节：

>>> import sys
>>> sys.getsizeof(int())
24
>>> sys.getsizeof(float())
24

鉴于您在该 2.2 GB 输入文件中有数亿行，报告的内存消耗应该不会出乎意料。

再补充一点，Python 解释器的某些版本（包括 CPython 2.6）以保留所谓的用于分配性能的空闲列表而闻名，特别是对于和类型的int对象float。一旦分配，在您的进程终止之前，该内存将不会返回给操作系统。也看看我第一次发现这个问题时发布的这个问题：

Python：垃圾收集失败？

解决此问题的建议包括：

使用子进程来进行内存消耗计算，例如，基于multiprocessing模块
使用在 C 中实现功能的库，例如 numpy、pandas
使用另一个解释器，例如 PyPy

score 2 · Accepted Answer

您可以做几件事：

在数据子集上运行您的代码。测量所需时间。外推到数据的完整大小。这会给你一个估计它会运行多长时间。

counter = 0 with open("final_edge_list.txt","r") as f: for line in f: counter += 1 if counter == 200000: break try: ...

在 1M 行上，它在我的机器上运行约 8 秒，因此对于具有约 100M 行的 2.2Gb 文件，它假设运行约 15 分钟。但是，一旦您克服了可用内存，它将不再存在。
你的图看起来是对称的
```
graph[src][destination] = weight
graph[destination][src] = weight
```
在您的图形处理代码中使用的对称性graph，将内存使用量减少一半。
使用数据子集对您的代码运行分析器，看看那里会发生什么。最简单的就是运行
```
python -m cProfile --sort cumulative youprogram.py
```
有一篇关于速度和内存分析器的好文章：http ://www.huyng.com/posts/python-performance-analysis/

score 2 · Accepted Answer

你不需要graph是 defaultdict(dict), user dict 而是；graph[src, destination] = weight并且graph[destination, src] = weight会做。或者只有其中之一。
为了减少内存使用，尝试将生成的数据集存储在 scipy.sparse 矩阵中，它消耗更少的内存并且可能会被压缩。
之后你打算如何处理你的节点列表？

python - 在python中读取一个大文件

4 回答 4

Related

Reference