sql - 尝试使用 Redshift SQL 计算累积的不同实体

Question

我正在尝试在时间序列中获取 Redshift 中不同对象的累积计数。最简单的方法是使用COUNT(DISTINCT myfield) OVER (ORDER BY timefield DESC ROWS UNBOUNDED PRECEDING)，但 Redshift 给出“不支持窗口定义”错误。

例如，下面的代码试图找到从第一周到现在每周的累积不同用户。但是，我收到“不支持窗口功能”错误。

SELECT user_time.weeks_ago, 
       COUNT(distinct user_time.user_id) OVER
            (ORDER BY weeks_ago desc ROWS UNBOUNDED PRECEDING) as count
FROM   (SELECT FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7) AS weeks_ago,
               ev.user_id as user_id
        FROM events as ev
        WHERE ev.action='some_user_action') as user_time

目标是建立一个累积的时间序列，包含执行过某个操作的唯一用户。关于如何做到这一点的任何想法？

score 4 · Accepted Answer

以下是如何将其应用于此处引用的示例，另外我添加了另一行复制 'table' 为 '2015-01-01' 以演示这如何计算差异。

该示例的作者对解决方案有误，但我只是在使用他的示例。

create table public.test
(
  "date" date,
  item varchar(8),
  measure int
)

insert into public.test
    values
      ('2015-01-01', 'table',   12),
      ('2015-01-01', 'table',   120),
      ('2015-01-01', 'chair',   51),
      ('2015-01-01', 'lamp',    8),
      ('2015-01-02', 'table',   17),
      ('2015-01-02', 'chair',   72),
      ('2015-01-02', 'lamp',    23),
      ('2015-01-02', 'bed',     1),
      ('2015-01-02', 'dresser', 2),
      ('2015-01-03', 'bed',     1);

WITH x AS (
    SELECT
      *,
      DENSE_RANK()
      OVER (PARTITION BY date
        ORDER BY item) AS dense_rank
    FROM public.test
)
SELECT
  "date",
  item,
  measure,
  max(dense_rank)
  OVER (PARTITION BY "date")
FROM x
ORDER BY 1;

CTE 为您提供每个日期每个项目的密集排名，然后主查询为您提供每个日期该密集排名的最大值，即每个日期的项目的不同计数。

您需要密集等级而不是直接等级来计算不同点。

score 3 · Accepted Answer

Figured out the answer. The trick turned out to be a set of nested subqueries, the inner one calculates the time of each user's first action. The middle subquery counts the total actions per time period, and the final outer query performs the cumulative sums over the time series:

(SELECT engaged_per_week.week as week,
       SUM(engaged_per_week.total) over (order by engaged_per_week.week DESC ROWS UNBOUNDED PRECEDING) as total
 FROM 
    -- COUNT OF FIRST TIME ENGAGEMENTS PER WEEK
    (SELECT engaged.first_week AS week,
            count(engaged.first_week) AS total
    FROM
       -- WEEK OF FIRST ENGAGEMENT FOR EACH USER
       (SELECT  MAX(FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7)) as first_week
        FROM     events ev
        WHERE    ev.name='some_user_action'
        GROUP BY ev.user_id) AS engaged

    GROUP BY week) as engaged_per_week
ORDER BY week DESC) as cumulative_engaged

score 1 · Accepted Answer

当您在这样的总和中使用 count distinct 时，它似乎正在工作：

 SELECT user_time.weeks_ago, 
   SUM(COUNT(distinct user_time.user_id)) OVER
        (ORDER BY weeks_ago desc ROWS UNBOUNDED PRECEDING) as test
        FROM   (SELECT FLOOR(EXTRACT(DAY FROM sysdate - ev.time) / 7) AS weeks_ago
            ,ev.user_id as user_id
    FROM events as ev
    WHERE ev.action='some_user_action'
    ) user_time
GROUP BY user_time.weeks_ago

score 1 · Accepted Answer

您应该使用 DENSE_RANK 而不是 count （不同）：

DENSE_RANK() OVER(PARTITION BY weeks_ago ORDER BY user_time.user_id)

score 0 · Accepted Answer

我遇到了同样的问题，但我已经将它应用到DENSE_RANK()下面MAX() over(partition by)的代码中，希望如果有人仍然在这个问题上苦苦挣扎，它会有所帮助：

-- IN NZ

select 
    id,NAME,count(distinct name) OVER (
        PARTITION BY id)
        from
edw.admin.test;

/*
create table edw.admin.test 
as 
(       
select 1 as id,'Anne' as name,500.0 as amt,'iv' as IID
    union ALL
select 1,'Jeni',550.0,'is'
    union ALL
select 1,'Arna',250.0,'is'
    union ALL
select 2,'Raj',290.0,'is'
    union ALL
select 1,'Anne',350.0,'ir'
    union ALL
select 1,NULL,350.0,'ir'
    union ALL
select 3,NULL,350.0,'ir'
    union ALL
select 3,NULL,350.0,'ir');

Output in NZ:
-------------------------
ID  NAME    COUNT
1   NULL    3
1   Anne    3
1   Anne    3
1   Arna    3
1   Jeni    3
2   Raj     1
3   NULL    0
3   NULL    0
*/


-- IN AWS RS



select id, name, max(DENSE_COUNT) over(partition by id)
from(
select 
    id,name,CASE WHEN name IS NULL THEN 0 ELSE DENSE_RANK() OVER (
        PARTITION BY id
        order by name) END AS DENSE_COUNT
        from
(       
select 1 as id,'Anne' as name,500.0 as amt,'iv' as IID
    union ALL
select 1,'Jeni',550.0,'is'
    union ALL
select 1,'Arna',250.0,'is'
    union ALL
select 2,'Raj',290.0,'is'
    union ALL
select 1,'Anne',350.0,'ir'
    union ALL
select 1,NULL,350.0,'ir'
    union ALL
select 3,NULL,350.0,'ir'
    union ALL
select 3,NULL,350.0,'ir'));

/*
Output in RS:
-------------------------
id  name    max
1   Anne    3
1   Anne    3
1   Arna    3
1   Jeni    3
1   NULL    3
2   Raj     1
3   NULL    0
3   NULL    0
*/

score 0 · Accepted Answer

以上解决方案都不适合我。这是做的那个。他思考这个问题的方法如下：

如果某人在第一个动作中做出了某些动作 - 在那一周计算他们
任何连续一周只计算其他用户 - 那些在前几周不存在的用户

因此，我们只需要找到每个用户出现的第一个日期时间段，然后按日期排序的那些累计总和，然后按日期分组并找到最大（累计）值。

with first_date as (SELECT user_id,
                           min(ev.date) as first_entry_date
                    FROM events 
                    WHERE certain_condition
                    GROUP by 1
               ),
     ranked as (SELECT count(*) OVER (ORDER BY first_entry_date rows unbounded preceding) as counts,
                        first_entry_date 
                 FROM deduped
              )
SELECT first_entry_date as day, 
       max(counts) as users_cum_sum 
FROM ranked 
GROUP BY 1

sql - 尝试使用 Redshift SQL 计算累积的不同实体

6 回答 6

Related

Reference