我创建了一个配置单元外部表,存储为按 event_date 日期分区的文本文件。
从 Hive 表中读取 spark 时,我们如何指定特定格式的 csv?
环境是
1. 1.Spark 1.5.0 - cdh5.5.1 Using Scala version 2.10.4(Java HotSpot(TM) 64 - Bit Server VM, Java 1.7.0_67)
2. Hive 1.1, CDH 5.5.1
斯卡拉脚本
sqlContext.setConf("hive.exec.dynamic.partition", "true")
sqlContext.setConf("hive.exec.dynamic.partition.mode", "nonstrict")
val distData = sc.parallelize(Array((1, 1, 1), (2, 2, 2), (3, 3, 3))).toDF
val distData_1 = distData.withColumn("event_date", current_date())
distData_1: org.apache.spark.sql.DataFrame = [_1: int, _2: int, _3: int, event_date: date]
scala > distData_1.show
+ ---+---+---+----------+
|_1 |_2 |_3 | event_date |
| 1 | 1 | 1 | 2016-03-25 |
| 2 | 2 | 2 | 2016-03-25 |
| 3 | 3 | 3 | 2016-03-25 |
distData_1.write.mode("append").partitionBy("event_date").saveAsTable("part_table")
scala > sqlContext.sql("select * from part_table").show
| a | b | c | event_date |
|1,1,1 | null | null | 2016-03-25 |
|2,2,2 | null | null | 2016-03-25 |
|3,3,3 | null | null | 2016-03-25 |
蜂巢表
create external table part_table (a String, b int, c bigint)
partitioned by (event_date Date)
row format delimited fields terminated by ','
stored as textfile LOCATION "/user/hdfs/hive/part_table";
select * from part_table shows
|part_table.a | part_table.b | part_table.c | part_table.event_date |
|1 |1 |1 |2016-03-25
|2 |2 |2 |2016-03-25
|3 |3 |3 |2016-03-25
看着hdfs
The path has 2 part files /user/hdfs/hive/part_table/event_date=2016-03-25
part-00000
part-00001
part-00000 content
1,1,1
part-00001 content
2,2,2
3,3,3
PS如果我们将表存储为orc,它会按预期写入和读取数据。
如果“字段终止于”是默认的,那么 Spark 可以按预期读取数据,因此我猜这将是一个错误。