hadoop - AVRO 文件上的 Hive 外部表只为所有列生成 NULL 数据

Question

我正在尝试Hive external table在avro使用spark-scala. 我正在使用CDH 5.16which has hive 1.1, spark 1.6.

我创建了hive external table，运行成功。但是当我查询NULL所有列的数据时。我的问题与此类似

经过一些研究，我发现这可能是模式的问题。但是我在该位置找不到这些 avro 文件的架构文件。

我对avro文件类型很陌生。有人可以在这里帮助我吗？

下面是我spark将文件保存为的代码片段avro：

df.write.mode(SaveMode.Overwrite).format("com.databricks.spark.avro").save("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")

下面是我的配置单元外部表创建语句：

create external table prod_order_avro
(ProductID string,
ProductName string,
categoryname string,
OrderDate string,
Freight string,
OrderID string,
ShipperID string,
Quantity string,
Sales string,
Discount string,
COS string,
GP string,
CategoryID string,
oh_Updated_time string,
od_Updated_time string
)
STORED AS AVRO
LOCATION '/user/hive/warehouse/transform.db/prod_order_avro';

以下是我查询数据时得到的结果： select * from prod_order_avro

同时，当我使用as读取这些avro文件并打印它们时，我得到了正确的结果。下面是我用来读取这些数据的代码：spark-scaladataframespark

val df=hiveContext.read.format("com.databricks.spark.avro").option("header","true").load("hdfs:path/user/hive/warehouse/transform.db/prod_order_avro")

我的问题是，

在创建这些avro文件时，我是否需要更改我的spark
代码以单独创建架构文件，或者将其嵌入
文件中。如果需要分开，那么如何实现呢？
如果不是如何创建hive表，以便自动从文件中检索架构。我读到，如果文件中存在架构，则在最新版本中，hive 会自行解决此问题。

请在这里帮助我

score 2 · Accepted Answer

解决了这个..这是一个架构问题。架构没有嵌入到文件中。所以我必须在创建表时使用并传递它avro来提取架构。avro-tools它现在工作。

我按照以下步骤操作：

avro从存储hdfs在本地系统文件中的文件中提取少量数据。以下是用于相同的命令：

sudo hdfs dfs -cat /path/file.avro | head --bytes 10K > /path/temp.txt
使用avro-tools getschema命令从该数据中提取模式：

avro-tools getschema /path/temp.txt
将生成的模式（它将以json数据的形式）复制到一个带有.avsc扩展名的新文件中，并将其上传到HDFS
在创建时Hive External table添加以下属性：

TBLPROPERTIES('avro.schema.url'='hdfs://path/schema.avsc')

hadoop - AVRO 文件上的 Hive 外部表只为所有列生成 NULL 数据

1 回答 1

Related

Reference