0

我们有一个包含使用 Spark 2.1.x 框架生成的复杂嵌套 avro 文件的庞大数据集的表。这些文件保存在 S3 中并从配置单元外部表中选择。

我公司最近决定将 ETL 解决方案升级到 Spark 2.4.x 框架。这是我的更改

(1) 我从

...
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.1.0",
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.1.0",
libraryDependencies += "com.databricks" %% "spark-avro" % "3.2.0",
libraryDependencies += "org.apache.avro" % "avro" % "1.8.1",

...
libraryDependencies += "org.apache.spark" %% "spark-core" % "2.4.8",
libraryDependencies += "org.apache.spark" %% "spark-sql" % "2.4.8",
libraryDependencies += "org.apache.spark" %% "spark-avro" % "2.4.8",

(2) 我更改了 Spark 2.1 代码,如下所示:

import spark.implicits._
val singersDF = spark.read.json("<path>/popstars_singleline.json")          
singersDF.write.format("com.databricks.spark.avro").save(s3_out_path+"/nested-spark21.avro")       

像这样到 Spark 2.4:

spark.conf.set("spark.sql.legacy.replaceDatabricksSparkAvro.enabled",true)
...
singersDF.write.format("avro").save(s3_out_path+"/nested-spark24.avro")
       

(3) 这是我在 hive 上创建的外部表:

CREATE EXTERNAL TABLE test_popstars
ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' 
STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' 
OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat'
LOCATION 's3a://<avro-data-path>/'
TBLPROPERTIES ( 'avro.schema.url'='s3a://<avro-schema-path>/popstars_spark21.avsc');

(4) 这是我的测试数据: {"popstars":[{"first_name":"Ariana","last_name":"Grande","favorite_color":"Blue","favorite_number":6,"address":[{"city":"Florida","street":"Queens Street"},{"city":"Los Angeles","street":"Hollywood"}]},{"first_name":"Shawn","last_name":"Mendes","favorite_color":"Red","favorite_number":8,"address":[{"city":"Toronto","street":"Kings Street"},{"city":"New York","street":"Manhattan"}]}]}

我现在可以使用 spark 2.4.5 库生成 avro 文件,将此文件保存在位于外部表路径的 s3 中。但是当我从这个 hive (2.4.6) 表中选择 * 时,我得到以下异常

Caused by: java.io.IOException: java.io.IOException: While processing file s3a://xxx.avro. Found topLevelRecord.xxx, expecting union 
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerChain.handleRecordReaderNextException(HiveIOExceptionHandlerChain.java:121)
at org.apache.hadoop.hive.io.HiveIOExceptionHandlerUtil.handleRecordReaderNextException(HiveIOExceptionHandlerUtil.java:109)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.doNext(HiveContextAwareRecordReader.java:378)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:79)
at org.apache.hadoop.hive.ql.io.HiveRecordReader.doNext(HiveRecordReader.java:33)
at org.apache.hadoop.hive.ql.io.HiveContextAwareRecordReader.next(HiveContextAwareRecordReader.java:118)

我探索了更多,发现了一些细节

  1. spark 2.4.x 生成的文件在顶部标题和嵌套结构的标题中有额外的标签 namespace":"topLevelRecord.xxx"。这些标签在 Spark < 2.4 生成的更深层次的文件中不存在。

例如 Spark v2.1.0 生成以下 avro 文件架构(仅限 topLevelRecord):

{"type":"record","name":"topLevelRecord","fields":[{"name":"popstars","type":[{"type":"array","items":[{"type":"record","name":"popstars","fields":[{"name":"address","type":[{"type":"array","items":[{"type":"record","name":"address","fields":[{"name":"city","type":["string","null"]},{"name":"street","type":["string","null"]}]},"null"]},"null"]},{"name":"favorite_color","type":["string","null"]},{"name":"favorite_number","type":["long","null"]},{"name":"first_name","type":["string","null"]},{"name":"last_name","type":["string","null"]}]},"null"]},"null"]}]}

Spark v2.4.5 生成以下 avro 文件架构(注意 topLevelRecord 和 topLevelRecord.popstars 标签):

{"type":"record","name":"topLevelRecord","fields":[{"name":"popstars","type":[{"type":"array","items":[{"type":"record","name":"popstars","namespace":"topLevelRecord","fields":[{"name":"address","type":[{"type":"array","items":[{"type":"record","name":"address","namespace":"topLevelRecord.popstars","fields":[{"name":"city","type":["string","null"]},{"name":"street","type":["string","null"]}]},"null"]},"null"]},{"name":"favorite_color","type":["string","null"]},{"name":"favorite_number","type":["long","null"]},{"name":"first_name","type":["string","null"]},{"name":"last_name","type":["string","null"]}]},"null"]},"null"]}]}

  1. 如果我将表属性更改为新模式,我只能选择 spark 2.4.x 生成的文件,但不能再读取使用 spark 2.1.x 生成的旧文件。

  2. 我试图按照这里的建议抑制命名空间标记,但它只抑制顶部标题中的命名空间,但更深层次的命名空间标记保持不变

  3. 我认为此问题可能在此之后开始浮出水面,并且代码发生了更改。但似乎没有建议如何选择已经生成的 avro 文件。

我的问题是如何在 Spark 2.4.5 中更改代码/模式/库,以便我的配置单元(2.3.6)可以成功地从 avro 文件中为 spark 2.1.x 和 spark 2.4 创建的所有文件选择字段。 x 框架?

4

0 回答 0