识别不连续的值总是有点棘手,并且涉及到几个嵌套的子查询(至少我想不出更好的解决方案)。
第一步是确定年份的非连续值:
步骤 1) 识别非连续值
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
这将返回以下结果:
公司| 职业 | 年份 | group_cnt
---------+------------+------+------------
谷歌 | 程序员 | 2000 | 1
谷歌 | 销售 | 2000 | 1
谷歌 | 销售 | 2001 | 0
谷歌 | 销售 | 2002 | 0
谷歌 | 销售 | 2004 | 1
Mozilla | 销售 | 2002 | 1
现在使用 group_cnt 值,我们可以为每个连续年份的组创建“组 ID”:
步骤 2) 定义组 ID
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
这将返回以下结果:
公司| 职业 | 年份 | group_nr
---------+------------+------+----------
谷歌 | 程序员 | 2000 | 1
谷歌 | 销售 | 2000 | 2
谷歌 | 销售 | 2001 | 2
谷歌 | 销售 | 2002 | 2
谷歌 | 销售 | 2004 | 3
Mozilla | 销售 | 2002 | 4
(6 行)
正如您所看到的,每个“组”都有自己的 group_nr,我们最终可以通过添加另一个派生表来使用它来聚合:
步骤 3) 最终查询
select company,
profession,
array_agg(year) as years
from (
select company,
profession,
year,
sum(group_cnt) over (order by company, profession, year) as group_nr
from (
select company,
profession,
year,
case
when row_number() over (partition by company, profession order by year) = 1 or
year - lag(year,1,year) over (partition by company, profession order by year) > 1 then 1
else 0
end as group_cnt
from qualification
) t1
) t2
group by company, profession, group_nr
order by company, profession, group_nr
这将返回以下结果:
公司| 职业 | 年
---------+------------+------
谷歌 | 程序员 | {2000}
谷歌 | 销售 | {2000,2001,2002}
谷歌 | 销售 | {2004}
Mozilla | 销售 | {2002}
(4 行)
如果我没记错的话,这正是你想要的。