0

我试图使用Bq load 加载BigQuery 外部表通过 Bq 命令行命令。执行 Bq 加载命令 - bq load --source_format=NEWLINE_DELIMITED_JSON {provided dataset_name}.{provided bq external_table_name} gs://{provided bucket_name} /{provided folder_name}/{provided folder_name}/{provided folder_name}/20220107/* 错误我得到的是:错误处理作业 '*:bqjob_r6bde3e8976b407bd_0000017e4295db78_1': bq_project_name:bq_dataset_name.bq_external_table_name is not allowed for this operation,因为它目前是外部的。任何人都遇到过这个错误,我没有找到我需要传递的任何参数来告诉 Bq 这是 Google 的 Bq 负载文档中的外部表。对此有任何见解真的会有所帮助吗?我尝试使用带有 external_table=True 的 GoogleCloudStorageToBigQueryOperator 加载外部表,但这也会产生一个错误,提示“' BigQuery 作业失败。错误是:{}'.format(err.content

Exception: BigQuery job failed. Error was: b'{\n  "error": {\n    "code": 409,\n    "message": "Already Exists: Table project_name:dataset_name.Bq_Externaltable_name",\n    "errors": [\n      {\n        "message": "Already Exists: Table project_name:dataset_name.Bq_Externaltable_name",\n        "domain": "global",\n        "reason": "duplicate"\n      }\n    ],\n    "status": "ALREADY_EXISTS"\n  }\n}\n
[2022-01-09 17:10:20,995] {base_task_runner.py:113} INFO - Job 230862: Subtask {subtask_name} [2022-01-09 17:10:20,993] {taskinstance.py:1147} ERROR - BigQuery job failed. Error was: b'{\n  "error": {\n    "code": 409,\n    "message": "Already Exists: Table project_name:dataset_name.Bq_Externaltable_name",\n    "errors": [\n      {\n        "message": "Already Exists: Table project_name:dataset_name.Bq_Externaltable_name",\n        "domain": "global",\n        "reason": "duplicate"\n      }\n    ],\n    "status": "ALREADY_EXISTS"\n  }\n}\n'
"
this error also threw me off because I created the external table using terraform using below code block
resource google_bigquery_table external_table_name {
project = local.project
dataset_id = google_bigquery_dataset.{provided_dataset_name}.dataset_id
table_id = local.{variable defined for Bq external table}
schema = file("${path.module}/../../../schema/{folder which holds schema json}/schemajsonforexternaltable.json")
depends_on = [google_bigquery_dataset.{provided_dataset_name}]
deletion_protection = false
external_data_configuration {
autodetect = false
source_format = "NEWLINE_DELIMITED_JSON"
source_uris = [
"gs://{bucket_name}-${var.environment}/{folder_name}/{folder_name}/{folder_name}/*"
]
}
}
so why am I doing all this and whats my end goal is I want to retrieve the file name like mentioned in the query below which Google provides an option to the external table as a pseudo column (_FILE_NAME)
SELECT
  p_num,
  _FILE_NAME AS file_loc /* use this column to know the file name used to build the row in the Bq External table*/
FROM
  `gcp_project_name.{dataset_name}.{Bq_External_Table_name}`;
If there is any any alternative other than using Bq external table to get the file name being used to build the row thats also fine I can switch to that approach as well.

@MikeKarp - 我上面的帖子有两个问题,一个是使用失败的 Bq load 命令加载 Bq 外部表,从这个尝试我的问题是是否可以使用 Bq load 加载 Bq 外部表?第二个是我试图加载通过 terraform 创建的外部表(提供外部表所需的源 uri 路径)使用带有 external_table=True 的 GoogleCloudStorageToBigQueryOperator 失败,“代码”:409,\n“消息”:“表已经存在。来自第二个不确定为什么当外部表已经通过 Terraform 在我的 GCP 项目中创建时,GoogleCloudStorageToBigQueryOperator 试图再次创建表

4

1 回答 1

0

Ahhhh 我相信我现在关注了,所以你有一个可以查询的现有外部表,并且你想将它加载到一个新的静态表中,对吗?最简单的方法是直接使用 SQL。

您可以通过 SQL 中的 CREATE 语句从单独的现有外部表创建新表:

CREATE TABLE `gcp_project_name.{dataset_name}.{new_standard_table_name}` as
SELECT *,
      _FILE_NAME AS file_loc 
FROM
 `gcp_project_name.{dataset_name}.{Bq_External_Table_name}`;

更新上面的表名后,您可以直接在 bigquery SQL 编辑器中运行它。

根据您上面的问题,我还保留了附加列,这很好——您可以在 SELECT 组件中添加任意数量的新派生等,这些派生将进入新表。

于 2022-01-11T20:01:49.737 回答