optimization - 优化 Hive 子查询查询

Question

我正在使用 HDP 2.6.2 和配置单元。

我有一种情况，我正在根据一个列从一个大表中更新一个分区表，而查询却表现不佳，我不明白为什么。下面的插入语句是一个示例

insert into partitioned_table partition(dt_month) select * from large_table where incremental_string_col > (select last_incremental_col from temp_tab)

在这里，我假设 where 子句中的子查询执行一次并缓存结果，或者 CBO 将基本上只有一行的整个 temp_tab 表发送到所有节点，但它似乎并没有像放置字符串那样工作作为文字的价值！

我可以明确声明需要在 hive 中缓存表吗？我可以明确声明一个查询只需要执行一次并缓存结果吗？我在这里想念什么？

我理解字符串的列不是最好的情况，但我无能为力。

任何帮助将非常感激！！

score 0 · Accepted Answer

您可以将交叉映射连接与单行子查询一起使用，然后按不等式条件过滤行：

select * 
  from large_table l
       cross join (single_row_subquery) s
 where l.incremental_string_col>s.last_incremental_col;

或者在单独的脚本中计算子查询并作为 hivevar 变量传递，如下所示：https ://stackoverflow.com/a/37821218/2700344

optimization - 优化 Hive 子查询查询

1 回答 1

Related

Reference