我使用Impala执行以下查询结构,它花费了 20 多个小时并且没有完成:
INSERT INTO Final_table
with t1
AS
(SELECT account_id, request_id, status_1
FROM table_1
WHERE status_1 = "20"
),
t2 AS
(
SELECT account_id, request_id, status_2
FROM table_2
WHERE status_2 = "10"
)
SELECT t2.account_id, t2.request_id, t2.status_1, t1.status_2
FROM t1
INNER JOIN t2
ON (t1.account_id = t2.account_id OR t1.request_id = t2.request_id);
问题正是在 ON 语句中的“OR”条件下,因为分别从 t1 产生的记录数约为 14M,而从 t2 单独产生的记录数约为 15M。因为我遇到了内存问题,所以我采用了 t1 和 t2 子查询,分别执行它们并将它们保存到新表中。然后根据以下内容执行加入:
CREATE TABLE sub_table_1
AS
SELECT account_id, request_id, status_1
FROM table_1
WHERE status_1 = "20"
CREATE TABLE sub_table_2
AS
SELECT account_id, request_id, status_2
FROM table_2
WHERE status_2 = "10"
INSERT INTO Final_table
SELECT t2.account_id, t2.request_id, t2.status_1, t1.status_2
FROM sub_table_1 AS t1
INNER JOIN sub_table_2 AS t2
ON (t1.account_id = t2.account_id OR t1.request_id = t2.request_id);
子表创建成功,但最终加入仍然面临同样的问题。如果我在两个步骤上执行连接,每个步骤都有一个条件,然后连接两个结果,这是否合乎逻辑?或者是否会有另一种帮助方法?