当我尝试通过 BigQuery 导出 JSON 对象时,当存在具有“null”值的字段时,它会从结果下载中消失。
下载查询示例:
EXPORT DATA OPTIONS(
uri='gs://analytics-export/_*',
format='JSON',
overwrite=true) AS
SELECT NULL AS field1
实际结果是:{}
当预期结果为:{field1: null}
如何像我在预期结果中显示的那样强制使用空值导出?
当我尝试通过 BigQuery 导出 JSON 对象时,当存在具有“null”值的字段时,它会从结果下载中消失。
下载查询示例:
EXPORT DATA OPTIONS(
uri='gs://analytics-export/_*',
format='JSON',
overwrite=true) AS
SELECT NULL AS field1
实际结果是:{}
当预期结果为:{field1: null}
如何像我在预期结果中显示的那样强制使用空值导出?
对于此 OP,您可以使用:
Select TO_JSON_STRING(NULL) as field1
Select 'null' as field1
在导出数据文档中,没有提到在输出中包含空值的选项,因此我认为您可以转到功能请求报告页面并为其创建一个请求。此外,对其他项目也有类似的观察和尚未支持的点,请参阅此处的详细信息。
有很多解决方法,让我向您展示 2 个选项,见下文:
选项 1:使用 bigquery 客户端库直接从 python 调用
from google.cloud import bigquery
import json
client = bigquery.Client()
query = "select null as field1, null as field2"
query_job = client.query(query)
json_list = {}
for row in query_job:
json_row = {'field1':row[0],'field2':row[1]}
json_list.update(json_row)
with open('test.json','w+') as file:
file.write(json.dumps(json_list))
选项 2:将 apache beam 数据流与 python 和 BigQuery 结合使用以产生所需的输出
import argparse
import re
import json
import apache_beam as beam
from apache_beam.io import BigQuerySource
from apache_beam.io import WriteToText
from apache_beam.options.pipeline_options import PipelineOptions
from apache_beam.options.pipeline_options import SetupOptions
def add_null_field(row, field):
if field!='skip':
row.update({field: row.get(field, None)})
return row
def run(argv=None, save_main_session=True):
parser = argparse.ArgumentParser()
parser.add_argument(
'--output',
dest='output',
required=True,
help='Output file to write results to.')
known_args, pipeline_args = parser.parse_known_args(argv)
pipeline_options = PipelineOptions(pipeline_args)
pipeline_options.view_as(SetupOptions).save_main_session = save_main_session
with beam.Pipeline(options=pipeline_options) as p:
(p
| beam.io.Read(beam.io.BigQuerySource(query='SELECT null as field1, null as field2'))
| beam.Map(add_null_field, field='skip')
| beam.Map(json.dumps)
| beam.io.Write(beam.io.WriteToText(known_args.output, file_name_suffix='.json')))
if __name__ == '__main__':
run()
要运行它:
python -m export --output gs://my_bucket_id/output/ \
--runner DataflowRunner \
--project my_project_id \
--region my_region \
--temp_location gs://my_bucket_id/tmp/
Note: Just replace my_project_id,my_bucket_id and my_region with the appropriate values. Look on your cloud storage bucket for output file.
这两个选项都会为您生成您正在寻找的输出:
{"field1": null, "field2": null}
请让我知道它是否对您有帮助并为您提供想要达到的结果。