0

我有一张用户表以及他们在给定日期触发的事件数:

日期 用户身份 活动
2021-08-27 1 5
2021-07-25 1 7
2021-07-23 2 3
2021-07-20 3 9
2021-06-22 1 9
2021-05-05 1 4
2021-05-05 2 2
2021-05-05 3 6
2021-05-05 4 8
2021-05-05 5 1

我想创建一个表格,显示每个日期的活跃用户数,其中活跃用户被定义为在给定日期或之前 30 天内的任何一天触发事件的人。

日期 ACTIVE_USERS
2021-08-27 1
2021-07-25 3
2021-07-23 2
2021-07-20 2
2021-06-22 1
2021-05-05 5

我尝试了以下查询,该查询仅返回在指定日期处于活动状态的用户:

SELECT COUNT(DISTINCT USERID), DATE
FROM table
WHERE DATE >= (CURRENT_DATE() - interval '30 days')
GROUP BY 2 ORDER BY 2 DESC;

我也尝试使用一个窗口函数,其间有行,但似乎最终得到了相同的结果:

SELECT
    DATE,
    SUM(ACTIVE_USERS) AS ACTIVE_USERS
FROM
(
SELECT
    DATE,
    CASE
        WHEN SUM(EVENTS) OVER (PARTITION BY USERID ORDER BY DATE ROWS BETWEEN 30 PRECEDING AND CURRENT ROW) >= 1 THEN 1
        ELSE 0
    END AS ACTIVE_USERS
FROM table
)
GROUP BY 1
ORDER BY 1

我在雪花上使用 SQL:ANSI。任何建议将不胜感激。

4

1 回答 1

2

作为窗口函数,这很棘手——因为count(distinct)不允许。您可以使用自联接:

select t1.date, count(distinct t2.userid)
from table t join
     table t2
     on t2.date <= t.date and
        t2.date > t.date - interval '30 day'
group by t1.date;

但是,这可能很昂贵。一种解决方案是“反透视”数据。也就是说,对每个用户“进入”和“退出”活动状态进行增量计数,然后进行累积和:

with d as (  -- calculate the dates with "ins" and "outs"
      select user, date, +1 as inc
      from table
      union all
      select user, date + interval '30 day', -1 as inc
      from table
     ),
     d2 as (  -- accumulate to get the net actives per day
      select date, user, sum(inc) as change_on_day,
             sum(sum(inc)) over (partition by user order by date) as running_inc
      from d
      group by date, user
     ),
     d3 as (  -- summarize into active periods
      select user, min(date) as start_date, max(date) as end_date
      from (select d2.*,
                   sum(case when running_inc = 0 then 1 else 0 end) over (partition by user order by date) as active_period
            from d2
           ) d2
      where running_inc > 0
      group by user
     )
select d.date, count(d3.user)
from (select distinct date from table) d left join
     d3
     on d.date >= start_date and d.date < end_date
group by d.date;
于 2021-08-30T10:46:38.347 回答