python - 下载bz2，读取内存中的压缩文件（避免内存溢出）

Question

正如标题所说，我正在下载一个 bz2 文件，里面有一个文件夹和很多文本文件......

我的第一个版本是在内存中解压的，但解压后虽然只有90mbs，但它有60个文件，每个文件750mb....电脑坏了！显然无法处理像 40gb 的 ram XD）

所以，问题是它们太大了，无法同时将所有内容保存在内存中......所以我正在使用这段代码，但它很糟糕（太慢了）：

response = requests.get('https:/fooweb.com/barfile.bz2')

# save file into disk:
compress_filepath = '{0}/files/sources/{1}'.format(zsets.BASE_DIR, check_time)
with open(compress_filepath, 'wb') as local_file:
    local_file.write(response.content)

#We extract the files into folder 
extract_folder = compress_filepath + '_ext'
with tarfile.open(compress_filepath, "r:bz2") as tar:
    tar.extractall(extract_folder)

# We process one file at a time:
for filename in os.listdir(extract_folder):
    filepath = '{0}/{1}'.format(extract_folder,filename)
    file = open(filepath, 'r').readlines()
    
    for line in file:
        some_processing(line)

有没有一种方法可以在不将其转储到磁盘的情况下做到这一点......并且一次只能从 .bz2 解压缩和读取一个文件？

非常感谢您提前抽出时间，我希望有人知道如何帮助我...

score 0 · Accepted Answer

#!/usr/bin/python3
import sys
import requests
import tarfile
got = requests.get(sys.argv[1], stream=True)
with tarfile.open(fileobj=got.raw, mode='r|*') as tar:
    for info in tar:
        if info.isreg():
            ent = tar.extractfile(info)
            # now process ent as a file, however you like
            print(info.name, len(ent.read()))

python - 下载bz2，读取内存中的压缩文件（避免内存溢出）

1 回答 1

Related

Reference