8

我是 Apache Hive 的新手。在处理外部表分区时,如果我直接向 HDFS 添加新分区,则在运行 MSCK REPAIR 表后不会添加新分区。以下是我尝试过的代码,

-- 创建外部表

hive> create external table factory(name string, empid int, age int) partitioned by(region string)  
    > row format delimited fields terminated by ','; 

--详细的表格信息

Location:  hdfs://localhost.localdomain:8020/user/hive/warehouse/factory     
Table Type:             EXTERNAL_TABLE           
Table Parameters:        
    EXTERNAL                TRUE                
    transient_lastDdlTime   1438579844  

-- 在 HDFS 中创建目录以加载表工厂的数据

[cloudera@localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory1'
[cloudera@localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'

-- 表格数据

cat factory1.txt
emp1,500,40
emp2,501,45
emp3,502,50

cat factory2.txt
EMP10,200,25
EMP11,201,27
EMP12,202,30

-- 从本地复制到 HDFS

[cloudera@localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory1.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory1'
[cloudera@localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory2.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'

-- 在元存储中更改表以更新

hive> alter table factory add partition(region='southregion') location '/user/hive/testing/testing1/factory2';
hive> alter table factory add partition(region='northregion') location '/user/hive/testing/testing1/factory1';            
hive> select * from factory;                                                                      
OK
emp1    500 40  northregion
emp2    501 45  northregion
emp3    502 50  northregion
EMP10   200 25  southregion
EMP11   201 27  southregion
EMP12   202 30  southregion

现在我创建了新文件 factory3.txt 以添加为表工厂的新分区

cat factory3.txt
user1,100,25
user2,101,27
user3,102,30

-- 创建路径并复制表数据

[cloudera@localhost ~]$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory2'
[cloudera@localhost ~]$ hadoop fs -copyFromLocal '/home/cloudera/factory3.txt' 'hdfs://localhost.localdomain:8020/user/hive/testing/testing1/factory3'

现在我执行了以下查询来更新添加的新分区的元存储

MSCK REPAIR TABLE factory;

现在该表没有给出 factory3 文件的新分区内容。在为表工厂添加分区时,我可以知道我在哪里做错了吗?

然而,如果我运行 alter 命令,那么它会显示新的分区数据。

hive> alter table factory add partition(region='eastregion') location '/user/hive/testing/testing1/factory3';

我可以知道为什么 MSCK REPAIR TABLE 命令不起作用吗?

4

2 回答 2

16

为了工作,应该使用MSCK命名约定。/partition_name=partition_value/例如在表的根目录下;

# hadoop fs -ls /user/hive/root_of_table/*
 /user/hive/root_of_table/day=20200101/data1.parq
 /user/hive/root_of_table/day=20200101/data2.parq
 /user/hive/root_of_table/day=20200102/data3.parq
 /user/hive/root_of_table/day=20200102/data4.parq

当您运行;msck repair table <tablename>的分区时 并且会自动添加。day2020010120200102

于 2015-11-24T13:42:23.130 回答
0

您必须将数据放在表位置目录中名为“region=eastregio”的目录中:

$ hadoop fs -mkdir 'hdfs://localhost.localdomain:8020/user/hive/warehouse/factory/region=eastregio'
$ hadoop fs -copyFromLocal '/home/cloudera/factory3.txt' 'hdfs://localhost.localdomain:8020/user/hive/warehouse/factory/region=eastregio'
于 2015-09-04T11:42:28.727 回答