我正在开发一个 AWS Glue 作业,它使用 S3(镶木地板文件)中的分区数据和作业书签。我在尝试使用作业书签功能进行每日增量加载时遇到了问题。这是我读取数据的方式:
val push: String = "p_date > '" + start + "' and (attribute=='x' or attribute=='y')"
logger.info("Using pushdown predicate: " + push)
val source = glueContext
.getCatalogSource(database = "testbase", tableName = "testtable", pushDownPredicate = push,
transformationContext = "source").getDynamicFrame()
这是 AWS Glue 生成的 Input-files.json,它是在初始完全加载后使用作业书签逻辑后创建的。不应处理任何新数据,这似乎与空的“文件”部分正确显示。
[{
"path": "s3://path/to/bucket/attribute=x",
"files": []
}, {
"path": "s3://path/to/bucket/attribute=y",
"files": []
}]
但是,不是记录文件被跳过,而是发生以下情况:
After final job bookmarks filter, processing 0.00% of 0 files in partition DynamicFramePartition(com.amazonaws.services.glue.DynamicRecord@7d679e8a,s3://path/to/bucket/attribute=x,1578972694000).
After final job bookmarks filter, processing 0.00% of 0 files in partition DynamicFramePartition(com.amazonaws.services.glue.DynamicRecord@7d679e8a,s3://path/to/bucket/attribute=y,1578972694000).
我想现在 Glue 尝试创建一个空的 DynamicFrame,然后失败并显示以下消息:
ERROR ApplicationMaster: User class threw exception: org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
org.apache.spark.sql.AnalysisException: Unable to infer schema for Parquet. It must be specified manually.;
at org.apache.spark.sql.wrapper.SparkSqlDecoratorDataSource$$anonfun$3.apply(SparkSqlDecoratorDataSource.scala:38)
at org.apache.spark.sql.wrapper.SparkSqlDecoratorDataSource$$anonfun$3.apply(SparkSqlDecoratorDataSource.scala:38)
at scala.Option.getOrElse(Option.scala:121)
at org.apache.spark.sql.wrapper.SparkSqlDecoratorDataSource.getOrInferFileFormatSchema(SparkSqlDecoratorDataSource.scala:37)
at org.apache.spark.sql.wrapper.SparkSqlDecoratorDataSource.resolveRelation(SparkSqlDecoratorDataSource.scala:53)
at com.amazonaws.services.glue.SparkSQLDataSource$$anonfun$getDynamicFrame$8.apply(DataSource.scala:640)
at com.amazonaws.services.glue.SparkSQLDataSource$$anonfun$getDynamicFrame$8.apply(DataSource.scala:604)
at com.amazonaws.services.glue.util.FileSchemeWrapper$$anonfun$executeWithQualifiedScheme$1.apply(FileSchemeWrapper.scala:63)
at com.amazonaws.services.glue.util.FileSchemeWrapper$$anonfun$executeWithQualifiedScheme$1.apply(FileSchemeWrapper.scala:63)
at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWith(FileSchemeWrapper.scala:57)
at com.amazonaws.services.glue.util.FileSchemeWrapper.executeWithQualifiedScheme(FileSchemeWrapper.scala:63)
at com.amazonaws.services.glue.SparkSQLDataSource.getDynamicFrame(DataSource.scala:603)
您之前是否在使用 AWS Glue 时遇到过类似的行为?我正在考虑为“要创建的”动态框架实施“空检查”,以阻止工作失败。或者您是否有任何 AWS 原生解决方案可以确保作业书签的正常功能?