join - Hive，小查询块join大表，为什么不能使用map join？

Question

我有一个关于 hive mapjoin 的问题，我知道小表何时加入大表，使用 mapjoin 更好，但是当我得到这样的 sql 时

select a.col1,
       a.col2,
       a.col3, 
       /* there has many columns from table a, ignore..*/
       b.col4,
       b.col5,
       b.col6
  from a
 inner join b
    on (a.id = b.id)
 where b.date = '2018-02-10'
   and b.hour = '10';

提示：
表b是大表，行：10000W+
表a是大表，行：10000W+
表b带有谓词只返回1000行，我认为这个sql将使用mapjoin，但执行计划是在reduce端加入......

谁能告诉我为什么？？

score 0 · Accepted Answer

我不是 hive 专家，但有时，用作 SQL 客户端的工具（即 MySQL Workbench）在设置中隐含了 1000 的限制。尝试自己指定一个限制并将其强制设置为高于 1000 的值。

例如，检查此图像：

这是 MySQL 工作台。除非您自己指定限制，否则限制会自动添加到您的查询中。

score 0 · Accepted Answer

尝试将where子句移动到子查询中：

select a.col1,
       a.col2,
       a.col3, 
       /* there has many columns from table a, ignore..*/
       b.col4,
       b.col5,
       b.col6
  from a
 inner join (select * from b where b.date = '2018-02-10' and b.hour = '10' )b 
    on a.id = b.id
 ;

此外，中间过滤（临时）表而不是子查询将 100% 工作，但这不是那么有效。

还要检查这些 Hive 配置参数：

set hive.auto.convert.join=true; --this enables map-join
set hive.mapjoin.smalltable.filesize=25000000; --size of table to fit in memory

如果小表不超过hive.mapjoin.smalltable.filesize参数指定的大小，join 将转换为 map-join。

join - Hive，小查询块join大表，为什么不能使用map join？

2 回答 2

Related

Reference