我正在尝试通过 Scala Spark API ( https://github.com/databricks/spark-xml ) 访问 spark-xml 库,以便从 S3 读取大量 XML 文件。
- S3 中跨 XML 文件的架构不同,因此简单地一次读取它们会导致某些字段在读取时损坏。我相信这是由于 XML 文件之间的架构不一致,这在很大程度上是相同的架构
- 因此,我指定要在模式对象中显式提取的标签
- 当涉及到数组类型时,我对用于定义架构的语法有疑问。
- 您可以在下面看到 XML 的架构。出于这个问题的目的,我只是想提取以下内容:
- _ProgramInfoID
- _VALUE(包含在数组类型对象 Line 中)
感谢这里的任何反馈!
以下代码示例仅提取_ProgramInfoID字段
val schema = StructType(
// BROACAST METADATA
Array(StructField("BroadcastMetadata",StructType(
// PROGRAM INFO
Array(StructField("ProgramInfo", StructType(
Array(StructField("_ProgramInfoID", StringType, nullable = true))
)))
))),
)
以下尝试同时读取ProgramInfoID 和 _VALUE,但在尝试定义架构对象时遇到错误
val schema = StructType(
// BROACAST METADATA
Array(StructField("BroadcastMetadata",StructType(
// PROGRAM INFO
Array(StructField("ProgramInfo", StructType(
Array(StructField("_ProgramInfoID", StringType, nullable = true))
)))
))),
// LINES
Array(StructField("Lines", StructType(
// Line
ArrayType(StructField("Line", StructType(
Array(StructField("element", StructType(
Array(StructField("_VALUE", StringType, nullable = true))
))))))
)))
)
错误:
<console>:45: error: type mismatch;
found : org.apache.spark.sql.types.StructField
required: org.apache.spark.sql.types.DataType
ArrayType(StructField("Line", StructType(
我意识到这是一个语法错误,但我无法找到关于如何将下面看到的模式转换为涉及 Spark 类型(如 ArrayType、StructField 和 StructType)的模式的良好文档。
涉及 XML 中数组类型对象的相关问题: spark 中用于 xml 处理的复杂自定义模式
但是,我无法使用那里的解决方案解决这个问题。
XML 示例数据模式
root
|-- BroadcastMetadata: struct (nullable = true)
| |-- ExtendedProgramInfo: struct (nullable = true)
| | |-- Schedule: struct (nullable = true)
| | | |-- AiringType: string (nullable = true)
| | | |-- PartNumber: long (nullable = true)
| | | |-- Program: struct (nullable = true)
| | | | |-- AdditionalProgramURL: string (nullable = true)
| | | | |-- AliasTitle: string (nullable = true)
| | | | |-- Delta: string (nullable = true)
| | | | |-- Descriptions: struct (nullable = true)
| | | | | |-- ProgramDescription: struct (nullable = true)
| | | | | | |-- Delta: string (nullable = true)
| | | | | | |-- _ProgramID: long (nullable = true)
| | | | | | |-- _RoviRemotePath: string (nullable = true)
| | | | |-- EpisodeNumber: string (nullable = true)
| | | | |-- EpisodeTitle: string (nullable = true)
| | | | |-- EventDate: string (nullable = true)
| | | | |-- Genres: struct (nullable = true)
| | | | | |-- ProgramGenre: array (nullable = true)
| | | | | | |-- element: struct (containsNull = true)
| | | | | | | |-- Delta: string (nullable = true)
| | | | | | | |-- Genre: string (nullable = true)
| | | | | | | |-- _RoviRemotePath: string (nullable = true)
| | | | |-- Grid2Title: string (nullable = true)
| | | | |-- GridTitle: string (nullable = true)
| | | | |-- ProgramOriginalCountry: struct (nullable = true)
| | | | | |-- Delta: string (nullable = true)
| | | | | |-- _RoviRemotePath: string (nullable = true)
| | | | |-- ProgramOriginalLanguage: struct (nullable = true)
| | | | | |-- Delta: string (nullable = true)
| | | | | |-- _RoviRemotePath: string (nullable = true)
| | | | |-- RecordDateTime: string (nullable = true)
| | | | |-- Syndicated: string (nullable = true)
| | | | |-- TVRatings: struct (nullable = true)
| | | | | |-- ProgramTVRating: struct (nullable = true)
| | | | | | |-- Delta: string (nullable = true)
| | | | | | |-- _RoviRemotePath: string (nullable = true)
| | | | |-- ThreeDLevel: string (nullable = true)
| | | | |-- TitleParentID: long (nullable = true)
| | | | |-- _RoviRemotePath: string (nullable = true)
| | | |-- ProgramID: long (nullable = true)
| | | |-- ProgramShowingType: string (nullable = true)
| | | |-- RecordDateTime: string (nullable = true)
| | | |-- _ScheduleID: long (nullable = true)
| |-- Market: struct (nullable = true)
| | |-- Country: string (nullable = true)
| | |-- _MarketName: string (nullable = true)
| |-- ProgramInfo: struct (nullable = true)
| | |-- CC: string (nullable = true)
| | |-- Category: string (nullable = true)
| | |-- _ProgramInfoID: long (nullable = true)
| |-- Station: struct (nullable = true)
| | |-- Active: long (nullable = true)
| | |-- _UniqueIdentifier: string (nullable = true)
| |-- TranscriptUrl: string (nullable = true)
| |-- ViewershipData: string (nullable = true)
|-- Lines: struct (nullable = true)
| |-- Line: array (nullable = true) --> SEE ARRAY TYPE HERE
| | |-- element: struct (containsNull = true)
| | | |-- _LineDateTime: timestamp (nullable = true)
| | | |-- _StationGUID: string (nullable = true)
| | | |-- _StationID: long (nullable = true)
| | | |-- _UTCDelta: long (nullable = true)
| | | |-- _UTCLineDateTime: string (nullable = true)
| | | |-- _VALUE: string (nullable = true)
|-- _BreakType: string (nullable = true)
|-- _Duration: double (nullable = true)
|-- _PageID: string (nullable = true)
|-- _StationGUID: string (nullable = true)
|-- _StationID: long (nullable = true)
我很感激这里的任何帮助,谢谢!