1

我使用Impala执行以下查询结构,它花费了 20 多个小时并且没有完成:

INSERT INTO Final_table
with t1
AS
(SELECT account_id, request_id, status_1
 FROM table_1
 WHERE status_1 = "20"
),
t2 AS
(
 SELECT account_id, request_id, status_2
 FROM table_2
 WHERE status_2 = "10"
)
SELECT t2.account_id, t2.request_id, t2.status_1, t1.status_2
FROM t1
INNER JOIN t2
ON (t1.account_id = t2.account_id OR t1.request_id = t2.request_id);

问题正是在 ON 语句中的“OR”条件下,因为分别从 t1 产生的记录数约为 14M,而从 t2 单独产生的记录数约为 15M。因为我遇到了内存问题,所以我采用了 t1 和 t2 子查询,分别执行它们并将它们保存到新表中。然后根据以下内容执行加入:

CREATE TABLE sub_table_1
AS
 SELECT account_id, request_id, status_1
 FROM table_1
 WHERE status_1 = "20"
CREATE TABLE sub_table_2
AS
 SELECT account_id, request_id, status_2
 FROM table_2
 WHERE status_2 = "10"
INSERT INTO Final_table
SELECT t2.account_id, t2.request_id, t2.status_1, t1.status_2
FROM sub_table_1 AS t1
INNER JOIN sub_table_2 AS t2
ON (t1.account_id = t2.account_id OR t1.request_id = t2.request_id);

子表创建成功,但最终加入仍然面临同样的问题。如果我在两个步骤上执行连接,每个步骤都有一个条件,然后连接两个结果,这是否合乎逻辑?或者是否会有另一种帮助方法?

4

1 回答 1

0

您可以使用联合

  1. 从第一次加入获取结果(结果)

2.result UNION result2 from 2nd join 条件

SELECT * FROM t1 JOIN t2 ON t1.account_id = t2.account_id UNION SELECT * FROM t1 JOIN t2 ON t1.request_id = t2.request_id

于 2021-01-18T15:42:34.647 回答