如果我有一个带有(简单情况)标题和一行数据的 CSV,其中一些值不存在(null),如下所示:
name,surname,age
John,,32
相对目录是这样的:
MyDataTable:
Type: AWS::Glue::Table
DependsOn: CatalogDatabaseName
Properties:
CatalogId: !Ref AWS::AccountId
DatabaseName: my_db
TableInput:
Name: my_data
TableType: EXTERNAL_TABLE
Parameters: {
"skip.header.line.count": "1",
"compressionType": "none",
"classification": "csv",
"columnsOrdered": "true",
"areColumnsQuoted": "true",
"delimiter": ",",
"typeOfData": "file",
"header": "true",
"inferSchema": false,
"quote": "\"",
"escape": "\""
}
StorageDescriptor:
OutputFormat: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
Columns:
- Name: name
Type: string
- Name: surname
Type: string
- Name: age
Type: int
InputFormat: org.apache.hadoop.mapred.TextInputFormat
Location: s3://somewhere/
SerdeInfo:
SerializationLibrary: org.apache.hadoop.hive.serde2.OpenCSVSerde
如果我尝试以这种方式(通过 Spark)通过目录读取数据:
glueContext.getCatalogSource(
database = "my_db",
tableName = "my_data")
.getDynamicFrame()
.printSchema()
我可以看到该列surname
已消失,因为该特定列没有数据:
root
|-- name: string
|-- age: int
如何避免 Glue/AWS 删除该列以及通常任何空列?