python - 在 Python 中加载 15GB 文件

Question

我有一个包含 25000 行的 15GB 文本文件。我在 Python 中创建了一个形式为：dict1 = {'':int}，dict2 = {'':dict1} 的多级字典。

我必须在我的程序中多次使用整个 dict2（大约 1000 次……在一个 for 循环中）。谁能告诉一个好的方法来做到这一点。

相同类型的信息存储在文件中（25000 个图像的不同 RGB 值的计数。每行 1 个图像）例如：文件的 1 行将类似于：image1：255,255,255-70；234,221,231-40；112,13,19-28；图片2：5,25,25-30；34,15,61-20；102,103,109-228；等等。

score 2 · Accepted Answer

最好的方法是使用分块。

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

f = open('really_big_file.dat')
for piece in read_in_chunks(f):
    process_data(piece)

请注意，当您开始处理大型文件时，移动到 map-reduce 习惯用法可能会有所帮助，因为您将能够独立处理单独的分块文件，而无需将完整的数据集拉入内存。

score 1 · Accepted Answer

在 python 中，如果你使用一个文件对象作为迭代器，你可以逐行读取一个文件，而无需在内存中打开整个文件。

for line in open("huge_file.txt"):
    do_something_with(line)

python - 在 Python 中加载 15GB 文件

2 回答 2

Related

Reference