aws-glue - 如何避免 AWS Glue DynamicFrame 在读取 CSV 时丢弃空列？

Question

如果我有一个带有（简单情况）标题和一行数据的 CSV，其中一些值不存在（null），如下所示：

name,surname,age
John,,32

相对目录是这样的：

MyDataTable:
  Type: AWS::Glue::Table
  DependsOn: CatalogDatabaseName
  Properties:
    CatalogId: !Ref AWS::AccountId
    DatabaseName: my_db
    TableInput:
      Name: my_data
      TableType: EXTERNAL_TABLE
      Parameters: {
        "skip.header.line.count": "1",
        "compressionType": "none",
        "classification": "csv",
        "columnsOrdered": "true",
        "areColumnsQuoted": "true",
        "delimiter": ",",
        "typeOfData": "file",
        "header": "true",
        "inferSchema": false,
        "quote": "\"",
        "escape": "\""
      }
      StorageDescriptor:
        OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
        Columns:
          - Name: name
            Type: string
          - Name: surname
            Type: string
          - Name: age
            Type: int
        InputFormat: org.apache.hadoop.mapred.TextInputFormat
        Location: s3://somewhere/
        SerdeInfo:
          SerializationLibrary: org.apache.hadoop.hive.serde2.OpenCSVSerde

如果我尝试以这种方式（通过 Spark）通过目录读取数据：

glueContext.getCatalogSource(
      database = "my_db",
      tableName = "my_data")
   .getDynamicFrame()
   .printSchema()

我可以看到该列surname已消失，因为该特定列没有数据：

root
|-- name: string
|-- age: int

如何避免 Glue/AWS 删除该列以及通常任何空列？

aws-glue - 如何避免 AWS Glue DynamicFrame 在读取 CSV 时丢弃空列？

0 回答 0

Related

Reference