0

我有以下数据湖的数据集,它充当 Dimension 的源,我想在其中迁移 Dimension 中的历史数据

例如:图像

Primarykey       Checksum     DateFrom     Dateto      ActiveFlag 
  1                  11         01:00       03:00         False
  1                  22         03:00       05:00         False 
  1                  22         05:00       07:00         False
  1                  11         07:00       09:00         False
  1                  11         09:00    12/31/999         TRUE

请注意,该datalake表有多个不属于维度的列,因此我们正在重新计算检查显示相同的值,但datefromdateto

with base as (
Select 
   Primary_key,
   checksum,
   first_value ( datefrom ) over ( partition by Primary_key ,checksum order by datefrom ) as Datefrom,
   last_value ( dateto ) over ( partition by Primary_key  ,checksum order by datefrom ) as Dateto,
   rownumber () over ( partition by Primary_key  ,checksum order by datefrom ) as latest_record 
from Datalake.user)
select * from base where latest_record = 1

数据显示为

Primarykey       Checksum     DateFrom     Dateto 
   1              11           01:00         12/31/999 
   1              22           03:00         07:00

但预期结果是

Primarykey       Checksum     DateFrom     Dateto 
   1              11           01:00         03:00 
   1              22           03:00         07:00
   1              11           07:00         12/31/999 

我尝试在单个查询中使用多种方式,但有什么好的建议吗?

4

3 回答 3

0

你只得到两行的原因是你的分区中有两列,Primarykeychecksum那些只有两种组合。预期输出中所需的行与预期输出中的第一行具有相同的Primarykeychecksum(1,11)。

我在您的数据中看到的会得到您的结果的东西是,如果您包含ActiveFlag在您的分区中。

WITH base AS (
    SELECT 
       primary_key,
       checksum,
       FIRST_VALUE (datefrom) OVER ( PARTITION BY primary_key, checksum, active_flag order by datefrom) AS datefrom,
       LAST_VALUE (dateto) OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS dateto,
       ROWNUMBER () OVER ( partition BY primary_key, checksum, active_flag order by datefrom) AS latest_record 
    FROM Datalake.user
)
SELECT * FROM base WHERE latest_record = 1
于 2019-11-21T19:14:20.333 回答
0

试试这个代码。应该在 Snowflake 和 Oracle 中都可以工作:如果校验和按 datefrom 更改顺序,则创建一个单独的组

**SNOWFLAKE**:
WITH base AS (
SELECT 
Primarykey,
   checksum,
   FIRST_VALUE( datefrom ) OVER ( PARTITION BY Primarykey ,checksum,checksum_group     ORDER BY datefrom ) AS Datefrom,
   LAST_VALUE( dateto ) OVER ( PARTITION BY Primarykey  ,checksum,checksum_group     ORDER BY datefrom ) AS Dateto,
   ROW_NUMBER() over ( PARTITION BY Primarykey  ,checksum,checksum_group ORDER BY     datefrom ) AS latest_record 
FROM(   
SELECT 
Primarykey,
   checksum,
   checksum_prev,
   datefrom,
   dateto,
   LAST_VALUE((case when checksum<>checksum_prev THEN group1 END)) IGNORE NULLS OVER     (
  ORDER BY group1
  ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW
) checksum_group
 FROM (
SELECT 
   Primarykey,
   checksum,
   datefrom,
   dateto,
   LAG(checksum, 1, 0) OVER (ORDER BY datefrom) AS checksum_prev,
   LPAD(1000 + ROW_NUMBER() OVER(ORDER BY (SELECT NULL)), 4, 0) as group1
FROM Datalake.user)
)
) 
SELECT * FROM base WHERE latest_record = 1

**Oracle**:
WITH base AS (
SELECT 
Primarykey,
   checksum,
   FIRST_VALUE ( datefrom ) OVER ( partition by Primarykey ,checksum,checksum_group     order by datefrom ) AS Datefrom,
   LAST_VALUE ( dateto ) OVER ( partition by Primarykey  ,checksum,checksum_group     order by datefrom ) AS Dateto,
   ROW_NUMBER() OVER ( PARTITION BY Primarykey  ,checksum,checksum_group ORDER BY     datefrom ) AS latest_record 
FROM(   
SELECT 
Primarykey,
   checksum,
   checksum_prev,
   datefrom,
   dateto,
   LAST_VALUE((CASE WHEN checksum<>checksum_prev THEN group1 END)) IGNORE NULLS 
   OVER (ORDER BY group1 ROWS BETWEEN UNBOUNDED PRECEDING AND CURRENT ROW)     checksum_group
 FROM (
SELECT 
   Primarykey,
   checksum,
   datefrom,
   dateto,
   LAG(checksum, 1, 0) OVER (ORDER BY DATEFROM) AS checksum_prev,
   LPAD(1000 + ROWNUM, 4, 0) as group1
FROM Datalake.user))) 
SELECT * FROM base WHERE latest_record = 1
于 2019-11-21T22:17:50.023 回答
0

我调整了查询​​,以便它可以在整个数据集上工作。由于缺少主键,整个数据都失败了。修改后的工作查询

在此处输入图像描述

于 2019-11-22T09:52:02.127 回答