3

在 Hive 中执行连接,然后使用 where 子句过滤输出时,Hive 编译器将尝试在连接表之前过滤数据。这称为谓词下推(http://allabouthadoop.net/what-is-predicate-pushdown-in-hive/

例如:

SELECT * FROM a JOIN b ON a.some_id=b.some_other_id WHERE a.some_name=6

如果启用了下推谓词(hive.optimize.ppd),则表 a 中 some_name = 6 的行将在执行连接之前被过滤。

但是,我最近还了解到,在将表与另一个表连接之前,还有另一种过滤数据的方法(https://vinaynotes.wordpress.com/2015/10/01/hive-tips-joins-occur-before -where子句/ )。

可以在ON子句中提供条件,在join之前先过滤表a

例如:

SELECT * FROM a JOIN b  ON a.some_id=b.some_other_id AND a.some_name=6

这两个都提供谓词下推优化吗?

谢谢

4

1 回答 1

4

Both are valid and in case of INNER JOIN and PPD both will work the same. But these methods works differently in case of OUTER JOINS

ON join condition works before join.

WHERE is applied after join.

Optimizer decides is Predicate push-down applicable or not and it may work, but in case of LEFT JOIN for example with WHERE filter on right table, the WHERE filter

SELECT * FROM a 
             LEFT JOIN b ON a.some_id=b.some_other_id 
 WHERE b.some_name=6 --Right table filter

will restrict NULLs, and LEFT JOIN will be transformed into INNER JOIN, because if b.some_name=6, it cannot be NULL.

And PPD does not change this behavior.

You can still do LEFT JOIN with WHERE filter if you add additional OR condition allowing NULLs in the right table:

SELECT * FROM a 
             LEFT JOIN b ON a.some_id=b.some_other_id 
 WHERE b.some_name=6 OR b.some_other_id IS NULL --allow not joined records

And if you have multiple joins with many such filtering conditions the logic like this makes your query difficult to understand and error prune.

LEFT JOIN with ON filter does not require additional OR condition because it filters right table before join, this query works as expected and easy to understand:

SELECT * FROM a 
             LEFT JOIN b ON a.some_id=b.some_other_id and b.some_name=6

PPD still works for ON filter and if table b is ORC, PPD will push the predicate to the lowest possible level to the ORC reader and will use built-in ORC indexes for filtering on three levels: rows, stripes and files.

More on the same topic and some tests: https://stackoverflow.com/a/46843832/2700344

So, PPD or not PPD, better use explicit ANSI syntax with ON condition and ON filtering if possible to keep the query as simple as possible and avoid converting to INNER JOIN unintentionally.

于 2019-04-28T20:00:27.340 回答