0

我一整天都在努力解决这个问题,我想转向一个 csv。

它代表隶属于英国公司大楼 API 中编号为“OC418979”的公司的官员。

我已经将 json 截断为在“项目”中仅包含 2 个对象。

我想得到的是这样的csv

OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
OC418979, country_of_residence, officer_role, appointed_on
...

有两个额外的复杂性:有两种类型的“官员”,有些是人,有些是公司,所以并非所有关键人物都出现在另一个人身上,反之亦然。我希望这些条目为“空”。第二个复杂性是那些嵌套对象,如“名称”,其中包含一个逗号!或地址,其中包含几个子对象(我想我可以在 pandas 中展平)。

{
  "total_results": 13,
  "resigned_count": 9,
  "links": {
    "self": "/company/OC418979/officers"
  },
  "items_per_page": 35,
  "etag": "bc7955679916b089445c9dfb4bc597aa0daaf17d",
  "kind": "officer-list",
  "active_count": 4,
  "inactive_count": 0,
  "start_index": 0,
  "items": [
    {
      "officer_role": "llp-designated-member",
      "name": "BARRICK, David James",
      "date_of_birth": {
        "year": 1984,
        "month": 1
      },
      "appointed_on": "2017-09-15",
      "country_of_residence": "England",
      "address": {
        "country": "United Kingdom",
        "address_line_1": "Old Gloucester Street",
        "locality": "London",
        "premises": "27",
        "postal_code": "WC1N 3AX"
      },
      "links": {
        "officer": {
          "appointments": "/officers/d_PT9xVxze6rpzYwkN_6b7og9-k/appointments"
        }
      }
    },
    {
      "links": {
        "officer": {
          "appointments": "/officers/M2Ndc7ZjpyrjzCXdFZyFsykJn-U/appointments"
        }
      },
      "address": {
        "locality": "Tadcaster",
        "country": "United Kingdom",
        "address_line_1": "Westgate",
        "postal_code": "LS24 9AB",
        "premises": "5a"
      },
      "identification": {
        "legal_authority": "UK",
        "identification_type": "non-eea",
        "legal_form": "UK"
      },
      "name": "PREMIER DRIVER LIMITED",
      "officer_role": "corporate-llp-designated-member",
      "appointed_on": "2017-09-15"
    }
  ]
}

我一直在做的是创建新的 json 对象来提取我需要的字段,如下所示:

{officer_address:.items[]?.address, appointed_on:.items[]?.appointed_on, country_of_residence:.items[]?.country_of_residence, officer_role:.items[]?.officer_role, officer_dob:items.date_of_birth, officer_nationality:.items[]?.nationality, officer_occupation:.items[]?.occupation}

但是查询运行了几个小时——我相信有更快的方法。

现在我正在尝试这种新方法 - 创建一个以公司编号为根的 json,并将其官员列表作为参数。

{(.links.self | split("/")[2]): .items[]}
4

2 回答 2

2

使用 jq,可以更轻松地从将要共享的顶级对象中提取值并生成所需的行。您需要将浏览这些项目的次数限制为最多一次。

$ jq -r '(.links.self | split("/")[2]) as $companyCode 
   | .items[]
   | [ $companyCode, .country_of_residence, .officer_role, .appointed_on ]
   | @csv
' input.json
于 2019-02-28T21:08:18.013 回答
1

好的,您想扫描官员列表,如果存在的话,从那里提取一些字段并以 csv 格式写入。

第一部分是从 json 中提取数据。假设你加载它是一个dataPython 对象,你有:

print(data['items'][0]['officer_role'], data['items'][0]['appointed_on'],
      data['items'][0]['country_of_residence'])

给出:

llp-designated-member 2017-09-15 England

是时候将所有内容与 csv 模块放在一起了:

import csv
...
with open('output.csv', 'w', newline='') as fd:
    wr = csv.writer(fd)
    for officer in data['items']:
        _ = wr.writerow(('OC418979',
                 officer.get('country_of_residence',''),
                 officer.get('officer_role', ''),
                 officer.get('appointed_on', '')
                 ))

如果键不存在,字典上的get方法允许使用默认值(此处为空字符串),并且csv模块确保如果字段包含逗号,则将其括在引号中。

使用您的示例输入,它给出:

OC418979,England,llp-designated-member,2017-09-15
OC418979,,corporate-llp-designated-member,2017-09-15
于 2019-02-28T19:12:40.627 回答