sql - 为什么 Hive 使用分区表下其他文件中的文件？

Question

我的Hive. 它只有一个分区：

show partitions hive_test;                       
OK
pt=20130805000000
Time taken: 0.124 seconds

但是当我执行一个简单的查询 sql 时，结果发现文件夹下的数据文件20130805000000。为什么它不只使用文件20130805000000？

sql：

SELECT buyer_id AS USER_ID from hive_test limit 1;

这是一个例外：

java.io.IOException: /group/myhive/test/hive/hive_test/pt=20130101000000/data
doesn't exist!
   at org.apache.hadoop.hdfs.DFSClient.listPathWithLocations(DFSClient.java:1045)
   at org.apache.hadoop.hdfs.DistributedFileSystem.listLocatedStatus(DistributedFileSystem.java:352)
   at org.apache.hadoop.fs.viewfs.ChRootedFileSystem.listLocatedStatus(ChRootedFileSystem.java:270)
   at org.apache.hadoop.fs.viewfs.ViewFileSystem.listLocatedStatus(ViewFileSystem.java:851)
   at org.apache.hadoop.hdfs.Yunti3FileSystem.listLocatedStatus(Yunti3FileSystem.java:349)
   at org.apache.hadoop.mapred.SequenceFileInputFormat.listLocatedStatus(SequenceFileInputFormat.java:49)
   at org.apache.hadoop.mapred.FileInputFormat.getSplits(FileInputFormat.java:242)
   at org.apache.hadoop.hive.ql.io.HiveInputFormat.getSplits(HiveInputFormat.java:261)
   at org.apache.hadoop.mapred.JobClient.writeOldSplits(JobClient.java:1238)

我的问题是为什么配置单元会尝试查找文件“ /group/myhive/test/hive/hive_test/pt=20130101000000/data ”，而不是“ /group/myhive/test/hive/hive_test/pt=20130101000000/ ”？

score 0 · Accepted Answer

您没有收到错误，因为您已经在配置单元表上创建了分区，但在 select 语句期间没有分配分区名称。

在 Hive 的分区实现中，表中的数据被拆分到多个分区。每个分区对应于分区列的特定值，并作为子目录存储在 HDFS 上表的目录中。查询表时，在适用的情况下，仅查询表所需的分区。

请在您的选择查询中提供分区名称或使用您的查询，如下所示：

select buyer_id AS USER_ID from hive_test where pt='20130805000000' limit 1;

请参阅链接以了解有关 hive 分区的更多信息。

sql - 为什么 Hive 使用分区表下其他文件中的文件？

1 回答 1

Related

Reference