2

当我尝试通过 BigQuery 导出 JSON 对象时,当存在具有“null”值的字段时,它会从结果下载中消失。

下载查询示例:

EXPORT DATA OPTIONS(
  uri='gs://analytics-export/_*',
  format='JSON',
  overwrite=true) AS


SELECT NULL AS field1  

实际结果是:{}

当预期结果为:{field1: null}

如何像我在预期结果中显示的那样强制使用空值导出?

4

1 回答 1

1

对于此 OP,您可以使用:

Select TO_JSON_STRING(NULL) as field1
Select 'null' as field1

导出数据文档中,没有提到在输出中包含空值的选项,因此我认为您可以转到功能请求报告页面并为其创建一个请求。此外,对其他项目也有类似的观察和尚未支持的点,请参阅此处的详细信息。

有很多解决方法,让我向您展示 2 个选项,见下文:

选项 1:使用 bigquery 客户端库直接从 python 调用

from google.cloud import bigquery
import json

client = bigquery.Client()

query = "select null as field1, null as field2"
query_job = client.query(query)

json_list = {}
for row in query_job:
    json_row = {'field1':row[0],'field2':row[1]}
    json_list.update(json_row)
    
with open('test.json','w+') as file:
    file.write(json.dumps(json_list))

选项 2:将 apache beam 数据流与 python 和 BigQuery 结合使用以产生所需的输出

import argparse
import re
import json

import apache_beam as beam
from apache_beam.io import BigQuerySource
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions



def add_null_field(row, field):
  if field!='skip':
    row.update({field: row.get(field, None)})
  return row


def run(argv=None, save_main_session=True):
    parser = argparse.ArgumentParser()
    parser.add_argument(
        '--output',
        dest='output',
        required=True,
        help='Output file to write results to.')
    known_args, pipeline_args = parser.parse_known_args(argv)

    pipeline_options = PipelineOptions(pipeline_args)
    pipeline_options.view_as(SetupOptions).save_main_session = save_main_session

    with beam.Pipeline(options=pipeline_options) as p:

        (p
        | beam.io.Read(beam.io.BigQuerySource(query='SELECT null as field1, null as field2'))
        | beam.Map(add_null_field, field='skip')
        | beam.Map(json.dumps)
        | beam.io.Write(beam.io.WriteToText(known_args.output, file_name_suffix='.json')))

if __name__ == '__main__': 
  run()

要运行它:

python -m export --output gs://my_bucket_id/output/ \
                 --runner DataflowRunner \
                 --project my_project_id \
                 --region my_region \
                 --temp_location gs://my_bucket_id/tmp/

Note: Just replace my_project_id,my_bucket_id and my_region with the appropriate values. Look on your cloud storage bucket for output file.

这两个选项都会为您生成您正在寻找的输出:

{"field1": null, "field2": null}

请让我知道它是否对您有帮助并为您提供想要达到的结果。

于 2021-10-22T10:04:59.233 回答