python - 如何展平嵌套的 json 数组？

Question

我需要在 Python 中使用不同级别的嵌套 JSON 数组来展平 JSON

我的 JSON 的一部分看起来像：

{
  "data": {
    "workbooks": [
      {
        "projectName": "TestProject",
        "name": "wkb1",
        "site": {
          "name": "site1"
        },
        "description": "",
        "createdAt": "2020-12-13T15:38:58Z",
        "updatedAt": "2020-12-13T15:38:59Z",
        "owner": {
          "name": "user1",
          "username": "John"
        },
        "embeddedDatasources": [
          {
            "name": "DS1",
            "hasExtracts": false,
            "upstreamDatasources": [
              {
                "projectName": "Data Sources",
                "name": "DS1",
                "hasExtracts": false,
                "owner": {
                  "username": "user2"
                }
              }
            ],
            "upstreamTables": [
              {
                "name": "table_1",
                "schema": "schema_1",
                "database": {
                  "name": "testdb",
                  "connectionType": "redshift"
                }
              },
              {
                "name": "table_2",
                "schema": "schema_2",
                "database": {
                  "name": "testdb",
                  "connectionType": "redshift"
                }
              },
              {
                "name": "table_3",
                "schema": "schema_3",
                "database": {
                  "name": "testdb",
                  "connectionType": "redshift"
                }
              }
            ]
          },
          {
            "name": "DS2",
            "hasExtracts": false,
            "upstreamDatasources": [
              {
                "projectName": "Data Sources",
                "name": "DS2",
                "hasExtracts": false,
                "owner": {
                  "username": "user3"
                }
              }
            ],
            "upstreamTables": [
              {
                "name": "table_4",
                "schema": "schema_1",
                "database": {
                  "name": "testdb",
                  "connectionType": "redshift"
                }
              }
            ]
          }
        ]
      }
    ]
  }
}

输出应该是这样的

样本输出

尝试使用json_normalize但无法使其工作。目前通过使用循环读取嵌套数组并使用键读取值来解析它。寻找一种更好的方式来规范化 JSON

score 0 · Accepted Answer

这是部分解决方案：

首先将数据保存在与脚本相同的目录中，JSON file称为data.json.

import json
import pandas as pd
from pandas.io.json import json_normalize

with open('data.json') as json_file:
    json_data = json.load(json_file)

new_data = json_data['data']['workbooks']

result = json_normalize(new_data, ['embeddedDatasources', 'upstreamTables'], ['projectName', 'name', 'createdAt', 'updatedAt', 'owner', 'site'], record_prefix='_')

result

输出：

	_姓名	_schema	_数据库名称	_database.connectionType	项目名称	姓名	创建时间	更新时间	所有者	地点
0	表格1	架构_1	测试数据库	红移	测试项目	wkb1	2020-12-13T15:38:58Z	2020-12-13T15:38:59Z	{'name': 'user1', 'username': 'John'}	{'name': 'site1'}
1	表_2	架构_2	测试数据库	红移	测试项目	wkb1	2020-12-13T15:38:58Z	2020-12-13T15:38:59Z	{'name': 'user1', 'username': 'John'}	{'name': 'site1'}
2	表3	架构_3	测试数据库	红移	测试项目	wkb1	2020-12-13T15:38:58Z	2020-12-13T15:38:59Z	{'name': 'user1', 'username': 'John'}	{'name': 'site1'}
3	表_4	架构_1	测试数据库	红移	测试项目	wkb1	2020-12-13T15:38:58Z	2020-12-13T15:38:59Z	{'name': 'user1', 'username': 'John'}	{'name': 'site1'}

接下来是什么？

我认为，如果您提前重新构造数据（例如 flattening 'database': {'name': 'testdb', 'connectionType': 'redshift'}），您将能够向参数添加更多fields内容meta。

正如您在 json_normalize 的文档中看到的，这里使用的四个参数是：

数据 dict or list of dicts：：
- 未序列化的 JSON 对象。
记录路径：： str or list of str默认无
- 每个对象中到记录列表的路径。如果未通过，数据将被假定为记录数组。
元 list of paths (str or list of str)：：默认无
- 用作结果表中每条记录的元数据的字段。
记录前缀：： str默认无
- 如果为 True，则在记录前加上点 (?) 路径，例如，如果记录的路径是 ['foo', 'bar']，则为 foo.bar.field。

python - 如何展平嵌套的 json 数组？

1 回答 1

Related

Reference