1

I need to analyze a large data set that is distributed as a lz4 compressed JSON file.

The compressed file is almost 1TB. I'd prefer not to uncompress it to disk due to cost. Each "record" in the dataset is very small, but it is obviously not feasible to read the entire data set into memory.

Any advice on how to iterate through records in this large lz4 compressed JSON file in Python 2.7?

4

1 回答 1

2

从python lz4 bindings的 0.19.1 版本开始,完全支持提供的缓冲 IO。因此,您应该能够执行以下操作:

import lz4.frame
chunk_size = 128 * 1024 * 1024
with lz4.frame.open('mybigfile.lz4', 'r') as file:
    chunk = file.read(size=chunk_size)
    # Do stuff with this chunk of data.

它将一次从文件中读取大约 128 MB 的数据。

旁白:我是 python lz4 包的维护者——如果你对包有问题,或者如果文档中有什么不清楚的地方,请在项目页面做文件问题。

于 2018-01-21T15:14:01.060 回答