json - 如何使用 ijson/other 来解析这个大的 JSON 文件？

Question

我有这个巨大的 json 文件（8gb），当我试图将它读入 Python 时内存不足。我将如何使用 ijson 或其他一些对大型 json 文件更有效的库来实现类似的过程？

import pandas as pd

#There are (say) 1m objects - each is its json object - within in this file. 
with open('my_file.json') as json_file:      
    data = json_file.readlines()
    #So I take a list of these json objects
    list_of_objs = [obj for obj in data]

#But I only want about 200 of the json objects
desired_data = [obj for obj in list_of_objs if object['feature']=="desired_feature"]

我将如何使用 ijson 或类似的东西来实现它？有没有一种方法可以在不读取整个 JSON 文件的情况下提取我想要的对象？

该文件是一个对象列表，例如：

{
    "review_id": "zdSx_SD6obEhz9VrW9uAWA",
    "user_id": "Ha3iJu77CxlrFm-vQRs_8g",
    "business_id": "tnhfDv5Il8EaGSXZGiuQGg",
    "stars": 4,
    "date": "2016-03-09",
    "text": "Great place to hang out after work: the prices are decent, and the ambience is fun. It's a bit loud, but very lively. The staff is friendly, and the food is good. They have a good selection of drinks.",
    "useful": 0,
    "funny": 0,
}

score 1 · Accepted Answer

该文件是对象列表

这有点模棱两可。查看您的代码片段，您的文件似乎在每一行都包含单独的 JSON 对象。这与实际 JSON 数组不同，该数组以开头[、以结尾]并,在项目之间具有。

在 json-per-line 文件的情况下，它很简单：

import json
from itertools import islice

with(open(filename)) as f:
    objects = (json.loads(line) for line in f)
    objects = islice(objects, 200)

注意区别：

你不需要.readlines()，文件对象本身是一个产生单独行的迭代
括号(..)而不是括号[..]在(... for line in f)内存中创建一个惰性生成器表达式而不是一个包含所有行的 Python 列表
islice(objects, 200)将为您提供前 200 个项目，而无需进一步迭代。如果objects是一个列表，你可以做objects[:200]

现在，如果您的文件实际上是一个 JSON 数组，那么您确实需要 ijson：

import ijson  # or choose a faster backend if needed
from itertools import islice

with open(filename) as f:
    objects = ijson.items(f, 'item')
    objects = islice(objects, 200)

ijson.items在已解析数组上返回一个惰性迭代器。第二'item'个参数中的表示“顶级数组中的每个项目”。

score 0 · Accepted Answer

问题是并非所有 JSON 格式都很好，您不能依赖逐行解析来提取对象。我将您的“接受标准”理解为“只想收集那些指定键包含指定值的 JSON 对象”。例如，仅在某人的名字是“Bob”时才收集有关该人的对象。以下函数将提供符合您条件的所有对象的列表。解析是逐个字符完成的（这在 C 中效率更高，但 Python 仍然相当不错）。这应该更健壮，因为它不关心换行符、格式化等。我在格式化和未格式化的 JSON 上用 1,000,000 个对象对此进行了测试。

import json

def parse_out_objects(file, feature, desired_value):
    with open(file) as f:
        compose_object_flag = False
        ignore_characters_flag = False
        object_string = ''
        selected_objects = []
        json_object = None
        while True:
            c = f.read(1)
            if c == '"':
                ignore_characters_flag = not ignore_characters_flag
            if c == '{' and ignore_characters_flag == False:
                compose_object_flag = True
            if c == '}' and compose_object_flag == True and ignore_characters_flag == False:
                compose_object_flag = False
                object_string = object_string + '}'
                json_object = json.loads(object_string)
                if json_object[feature] == desired_value:
                    selected_objects.append(json_object)
                object_string = ''
            if compose_object_flag == True:
                object_string = object_string + c
            if not c:
                break
        return selected_objects

json - 如何使用 ijson/other 来解析这个大的 JSON 文件？

2 回答 2

Related

Reference