0

我目前正在构建一个数据湖,我每天在其中运行 AWS GlueJobs 以复制我们数据库中的数据并使其可通过 AWS Athena 进行查询。因为我获取的数据架构经常发生变化,所以我会定期使用 Glue Crawler 对它们进行爬网。不幸的是,当我连续两天运行爬虫并且架构更改时,我收到有关不兼容架构的错误:

HIVE_CANNOT_OPEN_SPLIT: Error opening Hive split s3://***/raw/itemstore/parquet_flattened/v1/type=articles/year=2019/month=12/day=12/part-00012-13fc8243-cd4e-47b8-8763-56b15ea46e84-c000.snappy.parquet (offset=0, length=32745292): Schema mismatch, metastore schema for row column item__timeline.element has 10 fields but parquet schema has 9 fields

This query ran against the "***" database, unless qualified by the query. Please post the error message on our forum or contact customer support with Query Id: ***

这是我们的爬虫在云形成中的代码:

  ItemStoreCrawler:
    Type: AWS::Glue::Crawler
    Properties:
      Name: <A STRING>
      DatabaseName: !Ref DatabaseName
      Configuration: "{\"Version\": 1.0, \"CrawlerOutput\": {\"Partitions\": {\"AddOrUpdateBehavior\": \"InheritFromTable\"},\"Tables\":{\"AddOrUpdateBehavior\":\"MergeNewColumns\"}}}"
      Role: !GetAtt CrawlerRole.Arn
      TablePrefix: String 
      Tags:
        Platform: !Ref Platform
        Maintainer: !Ref Maintainer
        ServerType: !Ref ServerType
        ServiceName: !Sub ${ProjectName}
        Environment: !Ref Environment

      Targets:
        S3Targets:
          - Path: String

我的猜测是,我的爬虫的模式合并行为在开头的行中设置错误,Configuration但我找不到修复。

4

1 回答 1

0

这与让它忽略列顺序有关 - 我强烈建议不要使用 Glue Crawler - 使用 Glue 作为 Hive Metastore 将表直接写入 Athena 以避免这种情况。

https://docs.aws.amazon.com/athena/latest/ug/handling-schema-updates-chapter.html#summary-of-updates

于 2020-01-13T18:02:22.303 回答