如果两个表的每个访问者的记录数都很高,那么这个连接是有问题的,原因是 Marcin 在评论中描述的。因此,在这种情况下,如果可能的话,最好完全避免这种连接。
我最终解决这个问题的方法是废弃visitor_customer_hist 表并编写自定义窗口函数/udtf。
最初我创建lkp.visitor_customer_hist
表是因为它可以使用现有的窗口函数创建,并且可以在非 MPP sql 数据库上创建适当的索引,这将使查找具有足够的性能。它是这样创建的:
CREATE TABLE lkp.visitor_customer_hist AS
SELECT
a.visitor_id AS visitor_id,
a.customer_id AS customer_id,
nvl(lag(a.session_datetime) OVER ( PARTITION BY a.visitor_id
ORDER BY a.session_datetime ), '1900-01-01') AS from_datetime,
CASE WHEN lead(a.session_datetime) OVER ( PARTITION BY a.visitor_id
ORDER BY a.session_datetime ) IS NULL THEN '9999-12-31'
ELSE a.session_datetime END AS to_datetime
FROM (
SELECT
s.session_id,
vs.visitor_id,
customer_id,
row_number() OVER ( PARTITION BY vs.visitor_id, s.session_datetime
ORDER BY s.session_id ) AS rn,
lead(s.customer_id) OVER ( PARTITION BY vs.visitor_id
ORDER BY s.session_datetime ) AS next_cust_id,
session_datetime
FROM "session" s
JOIN "visitor_session" vs ON vs.session_id = s.session_id
WHERE s.customer_id <> -2
) a
WHERE (a.next_cust_id <> a.customer_id
OR a.next_cust_id IS NULL) AND a.rn = 1;
因此,我放弃了这种方法,而是编写了以下 UDTF:
CREATE OR REPLACE FUNCTION udtf_eff_customer(customer_id FLOAT)
RETURNS TABLE(effective_customer_id FLOAT)
LANGUAGE JAVASCRIPT
IMMUTABLE
AS '
{
initialize: function() {
this.customer_id = -1;
},
processRow: function (row, rowWriter, context) {
if (row.CUSTOMER_ID != -1) {
this.customer_id = row.CUSTOMER_ID;
}
rowWriter.writeRow({EFFECTIVE_CUSTOMER_ID: this.customer_id});
},
finalize: function (rowWriter, context) {/*...*/},
}
';
它可以像这样应用:
SELECT
iff(a.customer_id <> -1, a.customer_id, ec.effective_customer_id) AS customer_id,
a.session_id
FROM "session" a
JOIN table(udtf_eff_customer(nvl2(a.visitor_id, a.customer_id, NULL) :: DOUBLE) OVER ( PARTITION BY a.visitor_id
ORDER BY a.session_datetime DESC )) ec
所以这实现了预期的结果:对于每个会话,如果 customer_id 不是“未知”,那么我们继续使用它;否则,我们使用可以与该访问者关联的下一个 customer_id(如果存在)(按会话时间排序)。
这是一个比创建查找表更好的解决方案;它本质上只需要一次通过数据,需要更少的代码/复杂性,并且速度非常快。