我正在尝试使用 pyspark 的 HiveContext 将外部表加载为 avro 格式。外部表创建查询在 hive 中运行。但是,相同的查询在配置单元上下文中失败,错误为, org.apache.hadoop.hive.serde2.SerDeException: Encountered exception determining schema. Returning signal schema to indicate problem: null
我的 avro 架构如下。
{
"type" : "record",
"name" : "test_table",
"namespace" : "com.ent.dl.enh.test_table",
"fields" : [ {
"name" : "column1",
"type" : [ "null", "string" ] , "default": null
}, {
"name" : "column2",
"type" : [ "null", "string" ] , "default": null
}, {
"name" : "column3",
"type" : [ "null", "string" ] , "default": null
}, {
"name" : "column4",
"type" : [ "null", "string" ] , "default": null
} ]
}
我的创建表脚本是,
CREATE EXTERNAL TABLE test_table_enh ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 's3://Staging/test_table/enh' TBLPROPERTIES ('avro.schema.url'='s3://Staging/test_table/test_table.avsc')
我正在使用 spark-submit 运行下面的代码,
from pyspark import SparkConf, SparkContext
from pyspark.sql import HiveContext
print "Start of program"
sc = SparkContext()
hive_context = HiveContext(sc)
hive_context.sql("CREATE EXTERNAL TABLE test_table_enh ROW FORMAT SERDE 'org.apache.hadoop.hive.serde2.avro.AvroSerDe' STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerInputFormat' OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.avro.AvroContainerOutputFormat' LOCATION 's3://Staging/test_table/enh' TBLPROPERTIES ('avro.schema.url'='s3://Staging/test_table/test_table.avsc')")
print "end"
Spark 版本:2.2.0 OpenJDK 版本:1.8.0 Hive 版本:2.3.0