python - 如何从 4GB 的 JSON 文件中提取数据？

Question

我有一个 4GB 的 JSON 文件，其结构如下：

{
    rows: [
        { id: 1, names: { first: 'john', last: 'smith' }, dates: ...},
        { id: 2, names: { first: 'tim', middle: ['james', 'andrew'], last: 'wilson' }, dates: ... },
    ]
}

我只想遍历所有行，并为每一行提取 ID、名称和其他一些详细信息，并将其写入 CSV 文件。

如果我尝试以标准方式打开文件，它就会挂起。我一直在尝试使用IJSON，如下：

f = open('./myfile.json')
rows = ijson.items(f, 'rows')
for r in rows:
    print r

这适用于文件的简短提取，但在大文件上，它会永远挂起。

我也尝试过这种 IJSON 方法，它似乎适用于 4GB 的大文件：

for prefix, the_type, value in ijson.parse(open(fname)):
    print prefix, value

但这似乎会依次打印每个叶节点，没有将每个顶级行作为单独项目的概念 - 对于具有任意数量叶节点的 JSON 数据，这会变得非常快速。要获取所有名称的数组，我需要执行以下操作：

names = []
name = {}
for prefix, the_type, value in ijson.parse(open(fname)):
    print prefix, value
    name[prefix] = 'value'
    if 'first' in name and 'last' in name and 'middle' in name:
        # This is the last of the leaf nodes, we can add it to our list...
        # except.... how to deal with the fact that middle may not 
        # always be present?
        names.append(name)
        name = {}

有没有办法在这么大的文件中依次迭代每一行（而不是每一片叶子）？

python - 如何从 4GB 的 JSON 文件中提取数据？

0 回答 0

Related

Reference