sql - Hive / SQL - 带回退的左连接

Question

在 Apache Hive 中，我必须对表进行左连接，以保留左侧数据中的所有数据，并在可能的情况下从右侧表中添加数据。为此，我使用了两个连接，因为连接基于两个字段（material_id 和 location_id）。这适用于两个传统的左连接：

SELECT 
   a.*, 
   b.*
FROM a
INNER JOIN (some more complex select) b
   ON a.material_id=b.material_id 
   AND a.location_id=b.location_id;

对于 location_id，数据库只包含两个不同的值，比如 1 和 2。

我们现在有一个要求，如果没有“完美匹配”，这意味着只有 material_id 可以被连接，并且对于 location_id 的连接没有正确的 material_id 和 location_id 组合（例如 material_id=100 和 location_id=1）在 b 表中，连接应该“默认”或“回退”到 location_id 的其他可能值，例如 material_id=001 和 location_id=2，反之亦然。这应该只适用于 location_id。

我们已经用 CASE 等研究了所有可能的答案，但没有占上风。像这样的设置

...
ON a.material_id=b.material_id AND a.location_id=
CASE WHEN a.location_id = b.location_id THEN b.location_id ELSE ...;

我们尝试过或没有弄清楚如何用 hive 查询语言真正做到这一点。

谢谢您的帮助！也许有人有一个聪明的主意。

以下是一些示例数据：

Table a
| material_id | location_id | other_column_a |
| 100         | 1           | 45            |
| 101         | 1           | 45            |
| 103         | 1           | 45            |
| 103         | 2           | 45            |



Table b
| material_id | location_id | other_column_b |
| 100         | 1           | 66            |
| 102         | 1           | 76            |
| 103         | 2           | 88            |


Left - Join Table
| material_id | location_id | other_column_a | other_column_b
| 100         | 1           | 45            | 66
| 101         | 1           | 45            | NULL (mat. not in b)
| 103         | 1           | 45            | DEFAULT TO where location_id=2 (88)
| 103         | 2           | 45            | 88

PS：如here所述存在等在子查询ON中不起作用。

score 0 · Accepted Answer

也许这对将来的某人有帮助：

我们还提出了一种不同的方法。

首先，我们创建另一个表来根据所有 (!) 位置的 material_id 计算表 b 的平均值。

其次，在连接表中，我们创建三列： c1 - material_id 和 location_id 匹配的值（表 a 与表 b 的左连接的结果）。如果没有完美匹配，则此列为空。

c2 - 表中的值，我们在其中写入该 material_id 的平均值（备用）表中的数字（无论位置如何）

c3 - “实际值”列，我们使用 case 语句来决定第 1 列是否为 NULL（材料和位置不完全匹配），然后我们使用第 2 列的值（所有其他位置的平均值对于材料）用于进一步计算。

score 0 · Accepted Answer

解决方案是不使用左连接，a.location_id = b.location_id并按优先顺序对所有行进行编号。然后按row_number 过滤。在下面的代码中，连接将首先复制行，因为所有匹配的 material_id 都将被连接，然后row_number()函数将 1 分配给行 wherea.location_id = b.location_id和 2 给行 where a.location_id <> b.location_idif exists 还有行 where a.location_id = b.location_id和 1 if there is not exist 这样的。b.location_id添加到row_number() 函数中，因此如果没有精确匹配order by，它将“优先”具有较低的行。b.location_id我希望你已经抓住了这个想法。

select * from 
(
SELECT 
   a.*, 
   b.*,
   row_number() over(partition by material_id 
                     order by CASE WHEN a.location_id = b.location_id THEN 1 ELSE 2 END, b.location_id ) as rn
FROM a
LEFT JOIN (some more complex select) b
   ON a.material_id=b.material_id 
)s 
where rn=1
;

sql - Hive / SQL - 带回退的左连接

2 回答 2

Related

Reference