hadoop - 缓慢的蜂巢查询，为什么（左半连接）？

Question

我的 hive 查询挂起，我不知道为什么（使用 hadoop 0.20.1，hive 0.9）。

询问：

SELECT 
   a.field1 FROM table_1 a 
LEFT SEMI JOIN 
   (SELECT DISTINCT(usrId) FROM table_2 b 
       WHERE soemthing=true ORDER BY rand() limit 1000) random_user_ids 
WHERE a.usrId=random_user_ids.usrId

解释给我回来：

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-2 depends on stages: Stage-1
  Stage-3 depends on stages: Stage-2
  Stage-0 is a root stage

数据集

表内约 2 亿个条目
table_1.usrId 与 table_1 的关系应该是大约 1:40，即上面的查询应该返回 1000*40=40000

观察：

作业在 33% 的最后阶段 3 挂起（减少 > 排序），正在发生连接（连接结果约为 40000）
除了缓慢之外，为什么 reduce > sort 是 Stage-3 的一部分？它应该只加入东西而不是订购任何东西
reducer 大小只有 1（因为排序？），这几乎总是很糟糕，因为它不能扩展。

如果您需要更多输入（例如更详细的解释信息、更多集群信息），请告诉。

谢谢！

score 2 · Accepted Answer

JOIN 条件应该包含在ON子句中，而不是 WHERE 子句中。

语法示例：

SELECT a.key, a.val
FROM a LEFT SEMI JOIN b ON (a.key = b.key)

hadoop - 缓慢的蜂巢查询，为什么（左半连接）？

1 回答 1

Related

Reference