hadoop - HiveQL 查询性能优化

Question

随着 Hive 查询中 JOINS 数量的增加，查询分多个阶段运行，需要大量执行时间。如何提高查询性能。有需要设置的参数吗？

score 4 · Accepted Answer

First of all large tables should be placed as last one in join order: SELECT small., large. FROM small JOIN large ON small.joinkey=large.joinkey; You can use a hint to tell optimazier which table is biggest:

SELECT/*+ STREAMTABLE(large) */ small.*, large.* FROM large
JOIN small ON small.joinkey=large.joinkey;

Second the small tables could be cached in memory on join by Map-side join:

set hive.auto.convert.join = true;
SELECT a.*, b.* FROM a
JOIN b ON a.joinkey=b.joinkey;

Size of map-join table is set by:

set hive.mapjoin.smalltable.filesize = 1000000;

I hope it helps a bit. GL!

score 0 · Accepted Answer

除了上述当查询的 SELECT 或 WHERE 子句没有引用右表时，总是最好使用左半连接。

半连接比更一般的内连接更有效的原因如下。对于左侧表中的给定记录，一旦找到任何匹配项，Hive 就可以停止在右侧表中查找匹配记录。此时，可以投影左侧表记录中的选定列

score 0 · Accepted Answer

set hive.exec.parallel = True

这是通用的，使用适当的设置命令，我们可以根据您的集群配置优化更重要的查询。

hadoop - HiveQL 查询性能优化

3 回答 3

Related

Reference