sql - 或者 Hive 中的 JOIN 错误当前不支持

Question

我在 Hive 中运行一个查询，如下所示，并且在左连接中有 OR 条件。当我运行选择时，它会抛出一些错误消息。

OR 目前在 JOIN 中不受支持（了解 OR 仅适用于 Hive 中的 equi 连接）
在 JOIN 'cre_timestamp' 中遇到左右别名

           a.line_id,
           a.seller,
           a.sellerid,
           a.sellername,
           a.item_no,
           a.item_cd,
           a.cre_timestamp
     from Table A
     left join Table B
     on translate(a.id,'0','') = translate(b.id,'0','')
     or translate(a.seller,'Z','') = translate(b.seller,'Z','')
     or (a.item_no=b.item_no and a.item_no is not null and a.item_cd is not null and a.item_no <> '' and a.item_cd <> '')
     left join ( select id, line_id,cre_timestamp from table x) C
     on a.id=c.id
     and a.cre_timestamp < c.cre_timestamp
     and a.cre_timestamp > date_sub(c.cre_timestamp,21)
     and translate(a.id,'0','') = translate(b.id,'0','') or a.item_cd = b.item_cd
    where a.seller is null

我们怎样才能克服这个问题？

#For 1：我可以尝试编写查询的一种方法是，使用 UNION 将查询复制 3 次，用于 OR 条件。

#对于2：

如果我切

and a.cre_timestamp < c.cre_timestamp
     and a.cre_timestamp > date_sub(c.cre_timestamp,21)

并将其放入where底部的子句中，它可以正常工作。（想了解为什么它在连接中不起作用）

总的来说，寻找一种更好的方法，它不会影响运行时和更优化的查询，就像我将它更改为使用 UNION 一样，它必须处理相同的查询 3 次，这会影响查询。

感谢您花时间调查此事。

score 0 · Accepted Answer

我已经在这篇文章中尝试解释了为什么非 equi (theta) 连接在 map-reduce 框架中不起作用，这里不再赘述，请阅读：Why Hive can not support non-equi join

现在，如果将非相等连接条件移动到 where 子句会发生什么：连接将仅使用相等条件工作，并且可能会产生一些重复，因为它可以是多对多连接。这些重复项将按 WHERE 条件过滤。在最坏的情况下，如果你根本没有相等条件，就会执行 CROSS JOIN，这也很容易使用 MapReduce 框架实现，然后你可以在 where 中过滤行。过滤也很容易实现。

这是目前在 Hive 中实现 Theta-join 的唯一方法：在部分相等条件（甚至 CROSS JOIN）上使用复制连接加上过滤，这种方法会对性能产生重大的负面影响。但是，如果其中一个表足够小以适合内存，则可以使用 map-join 补偿对性能的负面影响：

set hive.auto.convert.join=true;
set hive.mapjoin.smalltable.filesize=1000000000; --small table size which can fit in memory, 
                                                 --adjust it and check Map Join operator in the plan

此外（这与问题无关）您的查询中有多余的条件：

(a.item_no=b.item_no and a.item_no is not null and a.item_cd is not null and a.item_no <> '' and a.item_cd <> '')

a.item_no is not null- 在这里没有任何用处，因为 1) 此列已在相等连接条件中使用并且 NULL 未连接，2) 还有另一个a.item_no <> ''排除 NULL 的条件，因为如果值不等于空字符串，它也不能为 NULL，NULL 不能等于或不等于某物。

a.item_cd is not null由于您已经拥有相同的冗余条件a.item_cd <> ''，因此不允许 NULL。

因此，整个条件可以简化为：

(a.item_no=b.item_no and a.item_no <> '' and a.item_cd <> '')

是的，将查询拆分为两个或多个 + UNION 是解决 OR 连接条件问题的常用方法。如果你有一些常用的过滤器，你可以使用WITH子查询来补偿扫描整个表多次。使用不同的过滤器和连接条件 + UNION 或 UNION ALL 拆分数据集也有助于使用倾斜的连接键。如果您使用 Tez，使用 WITH 子查询将允许读取一次表（在映射器上），所有其他顶点将读取映射器准备的相同结果，从而消除每次将中间结果写入持久存储。

sql - 或者 Hive 中的 JOIN 错误当前不支持

1 回答 1

Related

Reference