2

如果我有一个带有(简单情况)标题和一行数据的 CSV,其中一些值不存在(null),如下所示:

name,surname,age
John,,32

相对目录是这样的:

MyDataTable:
  Type: AWS::Glue::Table
  DependsOn: CatalogDatabaseName
  Properties:
    CatalogId: !Ref AWS::AccountId
    DatabaseName: my_db
    TableInput:
      Name: my_data
      TableType: EXTERNAL_TABLE
      Parameters: {
        "skip.header.line.count": "1",
        "compressionType": "none",
        "classification": "csv",
        "columnsOrdered": "true",
        "areColumnsQuoted": "true",
        "delimiter": ",",
        "typeOfData": "file",
        "header": "true",
        "inferSchema": false,
        "quote": "\"",
        "escape": "\""
      }
      StorageDescriptor:
        OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        Columns:
          - Name: name
            Type: string
          - Name: surname
            Type: string
          - Name: age
            Type: int
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        Location: s3://somewhere/
        SerdeInfo:
          SerializationLibrary: org.apache.hadoop.hive.serde2.OpenCSVSerde

如果我尝试以这种方式(通过 Spark)通过目录读取数据:

glueContext.getCatalogSource(
      database = "my_db",
      tableName = "my_data")
   .getDynamicFrame()
   .printSchema()

我可以看到该列surname已消失,因为该特定列没有数据:

root
|-- name: string
|-- age: int

如何避免 Glue/AWS 删除该列以及通常任何空列?

4

0 回答 0