json - PostgreSQL RDS 中 JSONB 列的 AWS Glue 爬虫

Question

我创建了一个爬虫，它查看带有 JSONB 列的 PostgreSQL 9.6 RDS 表，但爬虫将列类型标识为“字符串”。然后，当我尝试创建一个将 S3 上的 JSON 文件中的数据加载到 RDS 表中的作业时，我收到一个错误。

如何将 JSON 文件源映射到 JSONB 目标列？

score 2 · Accepted Answer

这不是一个直接的副本，但对我有用的一种方法是将目标表上的列定义为 TEXT。在 Glue 作业填充该字段后，我将其转换为 JSONB。例如：

alter table postgres_table
 alter column column_with_json set data type jsonb using column_with_json::jsonb;

请注意对现有文本数据使用强制转换。没有它，alter 列将失败。

score 1 · Accepted Answer

Crawler 会将 JSONB 列类型识别为“字符串”，但您可以尝试使用 Glue 中的 Unbox Class 将此列转换为 json

让我们在 PostgreSQL 中检查下表

create table persons (id integer, person_data jsonb, creation_date timestamp )

有一个来自人员表的记录的示例

ID = 1
PERSON_DATA = {
               "firstName": "Sergii",
               "age": 99,
               "email":"Test@test.com"
               }
CREATION_DATE = 2021-04-15 00:18:06

Glue中需要添加以下代码

# 1. create dynamic frame from catalog 
df_persons = glueContext.create_dynamic_frame.from_catalog(database = "testdb", table_name = "persons", transformation_ctx = "df_persons ")
# 2.in path you need to add your jsonb column name that need to be converted to json
df_persons_json = Unbox.apply(frame = df_persons , path = "person_data", format="json")
# 3. converting from dynamic frame to data frame 
datf_persons_json = df_persons_json.toDF()

# 4. after that you can process this column as a json datatype or create dataframe with all necessary columns , each json data element can be added as a separate column in dataframe : 
final_df_person = datf_persons_json.select("id","person_data.age","person_data.firstName","creation_date")

您还可以查看以下链接：

https://docs.aws.amazon.com/glue/latest/dg/aws-glue-api-crawler-pyspark-transforms-Unbox.html

json - PostgreSQL RDS 中 JSONB 列的 AWS Glue 爬虫

2 回答 2

Related

Reference