我有如下原始数据。每行是用户的交易记录,以及他们进行交易的月份
我想要的是计算一个月内下单的用户数和上个月的重复用户数(RETENTION),然后我可以知道有多少%的用户是重复用户。
我怎样才能在大查询中做到这一点?
一种方法是通过同一张表的自联接和 1 个月的延迟来实现。这样,我们将 user&month 组合与 user&previous-month 进行匹配,以查看它是否是回访用户。例如,使用 2M 行的公共表bigquery-public-data.hacker_news.stories
和特定用户:
请注意,2014-02-01prev_month
为null(我们使用LEFT OUTER JOIN
),因为用户在2014-01-01期间不活动。我们加入了作者并落后了几个月:
FROM authors AS a
LEFT OUTER JOIN authors AS b
ON a.author = b.author
AND a.month = DATE_ADD(b.month, INTERVAL 1 MONTH)
然后,如果上个月不为空,我们将用户计为重复用户:
COUNT(a.author) AS num_users,
COUNTIF(b.month IS NOT NULL) AS num_returning_users
请注意,我们在这里不使用,因为我们在定义为 CTEDISTINCT
时已经按作者和月份组合进行了分组。orders
对于其他示例,您可能需要考虑这一点。
完整查询:
WITH
authors AS (
SELECT
author,
DATE_TRUNC(DATE(time_ts), MONTH) AS month
FROM
`bigquery-public-data.hacker_news.stories`
WHERE
author IS NOT NULL
GROUP BY 1,2)
SELECT
*,
ROUND(100*SAFE_DIVIDE(num_returning_users,
num_users),2) AS retention
FROM (
SELECT
a.month,
COUNT(a.author) AS num_users,
COUNTIF(b.month IS NOT NULL) AS num_returning_users
FROM
authors AS a
LEFT OUTER JOIN
authors AS b
ON
a.author = b.author
AND a.month = DATE_ADD(b.month, INTERVAL 1 MONTH)
GROUP BY 1
ORDER BY 1
LIMIT 100)
和结果片段:
这是正确的结果,即2007-03-01
:
性能不是太花哨,但在这种情况下,我们只选择聚合数据所需的字段,因此扫描的数据很少,执行时间也不会太长(约 5 秒)。
另一种方法是使用EXISTS()
insideCOUNTIF()
而不是 join 但对我来说需要更长的时间(~7s)。询问
如果您只是查看上个月,请执行以下操作:
然后你可以使用lag()
:
select month,
count(*) as num_users,
countif(prev_month_int = month_int - 1) as prev_num_users,
countif(prev_month_int = month_int - 1) / count(*) as repeat_rate
from (select mu.*,
lag(month_int) over (partition by userid order by month_int) as prev_month_int
from (select month, userid, count(*) as num_orders,
cast(split(month, '-')[ordinal(1)] as int64) * 12 + cast(split(month, '-')[ordinal(2)] as int64) as month_int
from t
group by month, userid
) mu
) mu
group by month;