我已经成功地将 MySQL RDS 数据库中的数据提取到具有 Lake Formation 蓝图的 S3 存储桶中。
检查数据后,大约 41/60 个表已正确摄取。
错误搜索揭示了两件事:
- 由于蓝图/工作流程中的此错误,我的蓝图工作流程没有摄取所有表:
调用 o319.pyWriteDynamicFrame 时出错。二进制编码结果集中 14 列的第 9 列中的未知类型 '245。
- 正在创建缺少的表,但其中没有数据。通过检查 JSON 表属性,这是由初始爬网执行的。
我已经了解到,第 1 点的这个错误是将 JSON 识别为 MySQL 数据库的列类型。
以前有人遇到过这样的问题吗?我没有在 Glue 上编辑 AWS JDBC 驱动程序的经验,因为文档一如既往地很差。
我是否缺少明显的解决方法?
以下是已成功提取的表 (successful_table) 的 JSON 表属性:
{
"Name": "rds_DB_successful_table",
"DatabaseName": "rds-ingestion",
"CreateTime": "2020-06-23T14:07:04.000Z",
"UpdateTime": "2020-06-23T14:07:20.000Z",
"Retention": 0,
"StorageDescriptor": {
"Columns": [
{
"Name": "updated_at",
"Type": "timestamp"
},
{
"Name": "name",
"Type": "string"
},
{
"Name": "created_at",
"Type": "timestamp"
},
{
"Name": "id",
"Type": "int"
}
],
"Location": "s3://XXX-data-lake/DB/rds_DB_successful_tableversion_0/",
"InputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat",
"OutputFormat": "org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat",
"Compressed": false,
"NumberOfBuckets": 0,
"SerdeInfo": {
"SerializationLibrary": "org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe",
"Parameters": {
"serialization.format": "1"
}
},
"SortColumns": [],
"StoredAsSubDirectories": false
},
"TableType": "EXTERNAL_TABLE",
"Parameters": {
"CreatedByJob": "RDSCONNECTOR_etl_4_b968999a",
"CreatedByJobRun": "jr_37cc04c6fd928b9ff7a77fd50d6a98397a30c08ce3d56fae3fd618594585daea",
"LastTransformCompletedOn": "2020-06-23 14:07:20.508091",
"LastUpdatedByJob": "RDSCONNECTOR_etl_4_b968999a",
"LastUpdatedByJobRun": "jr_37cc04c6fd928b9ff7a77fd50d6a98397a30c08ce3d56fae3fd618594585daea",
"SourceConnection": "RDS Connection Type",
"SourceTableName": "DB_successful_table",
"SourceType": "JDBC",
"TableVersion": "0",
"TransformTime": "0:00:15.347357",
"classification": "PARQUET"
},
"IsRegisteredWithLakeFormation": true
}
以下是未成功提取但已创建的表 (bad_table) 的 JSON 表属性:
{
"Name": "_rds_DB_bad_table",
"DatabaseName": "rds-ingestion",
"Owner": "owner",
"CreateTime": "2020-06-23T13:44:19.000Z",
"UpdateTime": "2020-06-23T13:44:19.000Z",
"LastAccessTime": "2020-06-23T13:44:19.000Z",
"Retention": 0,
"StorageDescriptor": {
"Columns": [
{
"Name": "office_id",
"Type": "int"
},
{
"Name": "updated_at",
"Type": "timestamp"
},
{
"Name": "created_at",
"Type": "timestamp"
},
{
"Name": "id",
"Type": "int"
},
{
"Name": "position",
"Type": "int"
},
{
"Name": "id",
"Type": "int"
},
{
"Name": "deadline",
"Type": "date"
}
],
"Location": "DB.bad_table",
"Compressed": false,
"NumberOfBuckets": -1,
"SerdeInfo": {
"Parameters": {}
},
"BucketColumns": [],
"SortColumns": [],
"Parameters": {
"CrawlerSchemaDeserializerVersion": "1.0",
"CrawlerSchemaSerializerVersion": "1.0",
"UPDATED_BY_CRAWLER": "RDSCONNECTOR_discoverer_57904714",
"classification": "mysql",
"compressionType": "none",
"connectionName": "RDS Connection Type",
"typeOfData": "table"
},
"StoredAsSubDirectories": false
},
"PartitionKeys": [],
"TableType": "EXTERNAL_TABLE",
"Parameters": {
"CrawlerSchemaDeserializerVersion": "1.0",
"CrawlerSchemaSerializerVersion": "1.0",
"UPDATED_BY_CRAWLER": "RDSCONNECTOR_discoverer_57904714",
"classification": "mysql",
"compressionType": "none",
"connectionName": "RDS Connection Type",
"typeOfData": "table"
},
"CreatedBy": "arn:aws:sts::724135113484:assumed-role/LakeFormationWorkflowRole/AWS-Crawler",
"IsRegisteredWithLakeFormation": false
}
也许这些成功和失败的 JSON 表属性的比较是关键。
任何帮助将不胜感激!