0

我试图确定一些时间序列数据在几个时期内的最大变化。这是一个示例数据集:

drop table if exists query_table ;
create temp table query_table (groupcol TEXT, parcol TEXT, daycol Integer, val Integer);

insert into query_table values 
    ('g1', 'p1', 1, 1),
    ('g1', 'p1', 2, 2),
    ('g1', 'p1', 3, 3),
    ('g1', 'p1', 4, 4),
    ('g1', 'p2', 1, 2),
    ('g1', 'p2', 2, 4),
    ('g1', 'p2', 3, 6),
    ('g1', 'p2', 4, 8),
    ('g2', 'p1', 1, 10),
    ('g2', 'p1', 2, 20),
    ('g2', 'p1', 3, 30),
    ('g2', 'p1', 4, 40),
    ('g2', 'p2', 1, 20),
    ('g2', 'p2', 2, 40),
    ('g2', 'p2', 3, 60),
    ('g2', 'p2', 4, 80);

我正在做的基本查询如下所示(这是滞后 1 天):

with
  change_over_time as (
    select groupcol, parcol, daycol,
      (val - lag(val, 1) over (partition by groupcol, parcol order by daycol) ) as change
      from query_table
  ),
  max_change as (
    select groupcol, max(abs(change)) as maxchange
    from change_over_time
    group by groupcol
  )
select * from max_change;

这导致

groupcol  | maxchange
----------+------+-----------
 g1       |         2
 g2       |        20

我现在正在做的是发出这个查询并循环遍历 Python 中所需的滞后偏移量,但是这些查询需要一些时间,我想在纯 SQL 中执行此操作。此查询将在 Snowflake 中运行,我可以使用特定于 Snowflake 的扩展。

我能想到的唯一解决方案是使用 Python 生成这样的查询:

with
  change_over_time as (
      
        select groupcol, parcol, daycol, 1 as days,
          (val - lag(val, 1) over (partition by groupcol, parcol order by daycol) ) as change
          from query_table
    
    union all
  
        select groupcol, parcol, daycol, 2 as days,
          (val - lag(val, 2) over (partition by groupcol, parcol order by daycol) ) as change
          from query_table
   
    ),
   max_change as (
        select groupcol, days, max(abs(change)) as maxchange
        from change_over_time
        group by groupcol, days
  )
select * from max_change;

所以我有这样的结果:

 groupcol | days | maxchange
----------+------+-----------
 g1       |    1 |         2
 g2       |    1 |        20
 g1       |    2 |         4
 g2       |    2 |        40

但理想情况下,我想只使用 SQL 来运行许多不同的滞后(数百天,可能是 1 到 730 天),并且能够以干净的方式指定滞后。

4

2 回答 2

1

不完全确定我是否完全理解您要做什么。

尽管我认为您甚至可以在不使用延迟的情况下得到答案。

检查以下是否满足您的要求。

WITH
    day_table(days) AS (
        SELECT *
        FROM (VALUES (1), (2)) AS x
    )
SELECT
    qt1.groupcol,
    qt2.daycol - qt1.daycol     AS days,
    MAX(ABS(qt2.val - qt1.val)) AS maxchange
FROM
    query_table qt1
        JOIN query_table qt2
             ON qt1.groupcol = qt2.groupcol
                 AND qt1.parcol = qt2.parcol
                 AND qt2.daycol > qt1.daycol
        JOIN day_table dt
             ON qt2.daycol - qt1.daycol = dt.days
GROUP BY
    qt1.groupcol,
    qt2.daycol - qt1.daycol
ORDER BY
    groupcol,
    days

更新以添加绝对值并能够限制特定范围。

于 2021-08-16T15:48:16.087 回答
0

您将需要创建一个可以基于 change_over_time 查询的天数表。对于可变天数(比如表中的天数),这可以通过递归 CTE(https://docs.snowflake.com/en/user-guide/queries-cte.html#recursive-ctes -和分层数据)。对于固定的天数,值子句就足够了(https://docs.snowflake.com/en/sql-reference/constructs/values.html)。

这是带有附加值子句的查询:

with
  day_table(days) as (
    select * from (values (1), (2), (3), (4))
  ),
  change_over_time as (
    select t.groupcol, t.parcol, t.seq, d.days,
      (t.val - lag(t.val, d.days) over (partition by t.groupcol, t.parcol order by t.seq) ) as change
      from query_table t
      cross join day_table d
  ),
  max_change as (
    select groupcol, days, max(abs(change)) as maxchange
    from change_over_time1
    group by groupcol, days
  )
select * from max_change;
于 2021-08-16T05:17:45.803 回答