我试图确定一些时间序列数据在几个时期内的最大变化。这是一个示例数据集:
drop table if exists query_table ;
create temp table query_table (groupcol TEXT, parcol TEXT, daycol Integer, val Integer);
insert into query_table values
('g1', 'p1', 1, 1),
('g1', 'p1', 2, 2),
('g1', 'p1', 3, 3),
('g1', 'p1', 4, 4),
('g1', 'p2', 1, 2),
('g1', 'p2', 2, 4),
('g1', 'p2', 3, 6),
('g1', 'p2', 4, 8),
('g2', 'p1', 1, 10),
('g2', 'p1', 2, 20),
('g2', 'p1', 3, 30),
('g2', 'p1', 4, 40),
('g2', 'p2', 1, 20),
('g2', 'p2', 2, 40),
('g2', 'p2', 3, 60),
('g2', 'p2', 4, 80);
我正在做的基本查询如下所示(这是滞后 1 天):
with
change_over_time as (
select groupcol, parcol, daycol,
(val - lag(val, 1) over (partition by groupcol, parcol order by daycol) ) as change
from query_table
),
max_change as (
select groupcol, max(abs(change)) as maxchange
from change_over_time
group by groupcol
)
select * from max_change;
这导致
groupcol | maxchange
----------+------+-----------
g1 | 2
g2 | 20
我现在正在做的是发出这个查询并循环遍历 Python 中所需的滞后偏移量,但是这些查询需要一些时间,我想在纯 SQL 中执行此操作。此查询将在 Snowflake 中运行,我可以使用特定于 Snowflake 的扩展。
我能想到的唯一解决方案是使用 Python 生成这样的查询:
with
change_over_time as (
select groupcol, parcol, daycol, 1 as days,
(val - lag(val, 1) over (partition by groupcol, parcol order by daycol) ) as change
from query_table
union all
select groupcol, parcol, daycol, 2 as days,
(val - lag(val, 2) over (partition by groupcol, parcol order by daycol) ) as change
from query_table
),
max_change as (
select groupcol, days, max(abs(change)) as maxchange
from change_over_time
group by groupcol, days
)
select * from max_change;
所以我有这样的结果:
groupcol | days | maxchange
----------+------+-----------
g1 | 1 | 2
g2 | 1 | 20
g1 | 2 | 4
g2 | 2 | 40
但理想情况下,我想只使用 SQL 来运行许多不同的滞后(数百天,可能是 1 到 730 天),并且能够以干净的方式指定滞后。