hadoop - 使用 hive 在大范围分区中选择数据

Question

我使用 hive 在大范围分区中选择数据时遇到了一些问题

这是我要执行的 HQL：

INSERT OVERWRITE TABLE summary_T partition(DateRange='20131222-20131228')
select col1, col2, col3 From RAW_TABLE 
where cdate between '20131222' and '20131228' 
and (trim(col1) IS NULL or trim(col1)='')
and length(col2)=12;

"cdate" 是表 RAW_TABLE 的分区

但是在给我工作ID后它卡住了

一旦我将其更改为：

INSERT OVERWRITE TABLE summary_T partition(DateRange='20131222-20131228')
select col1, col2, col3 From RAW_TABLE 
where cdate between '20131222' and '20131225' 
and (trim(col1) IS NULL or trim(col1)='')
and length(col2)=12;

然后它开始工作

有什么解决方案可以帮助我执行第一个 HQL？

感谢您的帮助！

score 0 · Accepted Answer

我遇到了类似的问题，并尝试CLUSTER BY 'partition_column'在我的 SELECT 语句末尾使用。使用它后，我可以在更大的日期范围内执行我的 INSERT。

因此，如果您将查询更改为：

INSERT OVERWRITE TABLE summary_T partition(DateRange='20131222-20131228')
select col1, col2, col3 From RAW_TABLE 
where cdate between '20131222' and '20131228' 
and (trim(col1) IS NULL or trim(col1)='')
and length(col2)=12
CLUSTER BY DateRange;

性能会有所提高。

有关 CLUSTER BY 如何帮助查询的说明，您可以浏览此手册页，其中详细解释了它：

https://cwiki.apache.org/confluence/display/Hive/LanguageManual+SortBy

hadoop - 使用 hive 在大范围分区中选择数据

1 回答 1

Related

Reference