0

我在雪花数据仓库中的源数据存在“空白”,即缺失天数,我的要求是用前一行中的数据填充缺失的数据。所以我创建了这个虚拟项目来练习使用完整的外部连接和窗口功能FIRST_VALUE

我的样本数据有两张表:一张叫prev_test,一张叫dim_date

create TABLE prev_test (
    SITE_ID VARCHAR(650),
    SUBSCRIPTION_ID VARCHAR(650),
    ORDER_CREATED DATE,
    ORDER_TYPE VARCHAR(650),
    SUBSCRIPTION_STATUS VARCHAR(650),
    PERIOD_NORMALIZER VARCHAR(650),
    CHANGE_MRR_EVENT_TYPE VARCHAR(650),
    TOTAL integer,
    DAILY_MRR integer
);

INSERT INTO prev_test 
VALUES('AB', '123', '2021-09-17', 'PRORATED', 'ACTIVE', '1M', 'New', 60, 2);
INSERT INTO prev_test 
VALUES('AB', '123', '2021-09-20', 'PRORATED', 'ACTIVE', '1M', 'New', 30, 10);

create TABLE dim_date (
  date_key date
  );
INSERT INTO dim_date 
VALUES('2021-09-17');
INSERT INTO dim_date 
VALUES('2021-09-18');
INSERT INTO dim_date 
VALUES('2021-09-19');
INSERT INTO dim_date 
VALUES('2021-09-20');

当我这样选择我的数据时

SELECT 
CASE WHEN SITE_ID IS NULL THEN FIRST_VALUE(SITE_ID) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE SITE_ID END AS SITE_ID , 
CASE WHEN SUBSCRIPTION_ID IS NULL THEN FIRST_VALUE(SUBSCRIPTION_ID) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE SUBSCRIPTION_ID END AS SUBSCRIPTION_ID , 
CASE WHEN ORDER_CREATED IS NULL THEN dd.DATE_KEY ELSE ORDER_CREATED END AS ORDER_CREATED, 
CASE WHEN ORDER_TYPE IS NULL THEN FIRST_VALUE(ORDER_TYPE) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE ORDER_TYPE END AS ORDER_TYPE , 
CASE WHEN SUBSCRIPTION_STATUS IS NULL THEN FIRST_VALUE(SUBSCRIPTION_STATUS) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE SUBSCRIPTION_STATUS END AS SUBSCRIPTION_STATUS , 
CASE WHEN PERIOD_NORMALIZER IS NULL THEN FIRST_VALUE(PERIOD_NORMALIZER) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE PERIOD_NORMALIZER END AS PERIOD_NORMALIZER , 
CASE WHEN CHANGE_MRR_EVENT_TYPE IS NULL THEN FIRST_VALUE(CHANGE_MRR_EVENT_TYPE) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE CHANGE_MRR_EVENT_TYPE END AS CHANGE_MRR_EVENT_TYPE , 
CASE WHEN TOTAL IS NULL THEN FIRST_VALUE(TOTAL) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE TOTAL END AS TOTAL , 
CASE WHEN DAILY_MRR IS NULL THEN FIRST_VALUE(DAILY_MRR) OVER (ORDER BY ORDER_CREATED ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE DAILY_MRR END AS DAILY_MRR  
FROM prev_test
FULL OUTER JOIN DIM_DATE dd ON ORDER_CREATED = dd.DATE_KEY 
ORDER BY 3 asc

这给出了我想要的输出

但是,一旦我将相同的代码应用于我的生产数据(与我的示例相同的结构),它就根本不起作用。缺少的天数未填写,FIRST_VALUE 不会将数据“重复”到以下记录。没有错误。

这是我的“真实”数据的一个例子


SITE_ID    |SUBSCRIPTION_ID       |ORDER_CREATED|ORDER_TYPE     |SUBSCRIPTION_STATUS|PERIOD_NORMALIZER|CHANGE_MRR_EVENT_TYPE  |TOTAL   |DAILY_MRR     |
L22|JriInfs|   2021-06-02|PRORATED|Active          |1M               |Upgraded From  |  0.0000|  0.000000|
L22|JriInfs|   2021-09-17|PRORATED|Active          |1M               |New             |209.0000|209.000000|
L22|JriInfs|   2021-09-30|PRORATED|Active          |1M               |Changed         |269.0000|269.000000|
L22|JriInfs|   2021-10-08|PRORATED|Active          |1M               |Downgraded From|  0.0000|  0.000000|

这是我想要的输出(** 是“缺失”的日子),我希望这会有所帮助:


SITE_ID    |SUBSCRIPTION_ID       |ORDER_CREATED|ORDER_TYPE     |SUBSCRIPTION_STATUS|PERIOD_NORMALIZER|CHANGE_MRR_EVENT_TYPE  |TOTAL   |DAILY_MRR     |
L22|JriInfs|   2021-06-02|PRORATED|Active          |1M               |Upgraded From  |  0.0000|  0.000000|
**L22|JriInfs|   2021-06-03|PRORATED|Active          |1M               |Upgraded From  |  0.0000|  0.000000|**
…
**L22|JriInfs|   2021-09-16|PRORATED|Active          |1M               |Upgraded From  |  0.0000|  0.000000|**
L22|JriInfs|   2021-09-17|PRORATED|Active          |1M               |New             |209.0000|209.000000|
**L22|JriInfs|   2021-09-18|PRORATED|Active          |1M               |New             |209.0000|209.000000|**
…
**L22|JriInfs|   2021-09-29|PRORATED|Active          |1M               |New             |209.0000|209.000000|**
L22|JriInfs|   2021-09-30|PRORATED|Active          |1M               |Changed         |269.0000|269.000000|
...

我错过了什么?

4

2 回答 2

0

这是我的问题的解决方案


SELECT DISTINCT
CASE 
WHEN srm.SITE_ID IS NULL THEN FIRST_VALUE(srm.SITE_ID) OVER (ORDER BY dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.SITE_ID END AS SITE_ID, 
CASE WHEN srm.SUBSCRIPTION_ID IS NULL THEN FIRST_VALUE(SUBSCRIPTION_ID) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.SUBSCRIPTION_ID END AS SUBSCRIPTION_ID , 
DD.DATE_KEY AS ORDER_CREATED,
CASE WHEN srm.ORDER_TYPE IS NULL THEN FIRST_VALUE(srm.ORDER_TYPE) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.ORDER_TYPE END AS ORDER_TYPE , 
CASE WHEN srm.SUBSCRIPTION_STATUS IS NULL THEN FIRST_VALUE(SUBSCRIPTION_STATUS) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.SUBSCRIPTION_STATUS END AS SUBSCRIPTION_STATUS , 
CASE WHEN srm.PERIOD_NORMALIZER IS NULL THEN FIRST_VALUE(PERIOD_NORMALIZER) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.PERIOD_NORMALIZER END AS PERIOD_NORMALIZER , 
CASE WHEN srm.CHANGE_MRR_EVENT_TYPE IS NULL THEN FIRST_VALUE(CHANGE_MRR_EVENT_TYPE) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.CHANGE_MRR_EVENT_TYPE END AS CHANGE_MRR_EVENT_TYPE , 
CASE WHEN srm.TOTAL IS NULL THEN FIRST_VALUE(TOTAL) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.TOTAL END AS TOTAL , 
CASE WHEN srm.DAILY_MRR IS NULL THEN FIRST_VALUE(DAILY_MRR) OVER (ORDER BY  dd.DATE_KEY ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) ELSE srm.DAILY_MRR END AS DAILY_MRR  
FROM DIM_DATE dd
    LEFT JOIN SUBSCRIPTION_REVENUE_MART srm 
        ON  srm.ORDER_CREATED = dd.DATE_KEY
ORDER BY dd.DATE_KEY;

变化如下:

  • 联接已从 FULL 更改为 LEFT
  • 桌子的顺序被颠倒了
  • 查询不是拉 ORDER_CREATED,而是在 SELECT 子句中拉 DATE_KEY
于 2021-10-21T16:57:09.003 回答
0

您的查询的问题是您基于ORDER_CREATED. 此处不使用转换后的值,而对于缺少的行,该值是NULL,因此结果FIRST_VALUE是 NULL。您应该DATE_KEY改为订购。但是,ordering byDATE_KEY仍然会获取不正确的结果,因为丢失的行总是会填充第一行的值。在您的示例中,如果您有另一个插入

INSERT INTO prev_test 
VALUES('AB', '123', '2021-09-23', 'PRORATED', 'ACTIVE', '1M', 'New', 30, 10);

日期 21 日和 22 日的行将从 17 日(这是不正确的)获取值,而不是从 20 日获取值。

我尝试了以下查询并且它有效。您可以修改它以使用FIRST_VALUE

select
case when site_id is not null then site_id 
else max(site_id) over(partition by c_val order by order_created) end as site_id,
case when SUBSCRIPTION_ID is not null then SUBSCRIPTION_ID
else max(SUBSCRIPTION_ID) over(partition by c_val order by order_created) end as SUBSCRIPTION_ID,
order_created
from
(select 
*,
sum(val)over(order by order_created rows unbounded preceding) as c_val from
(
  select
  site_id,
  subscription_id,
  coalesce(a.order_created,dd.DATE_KEY) as order_created,
  ORDER_TYPE,
  SUBSCRIPTION_STATUS,
  PERIOD_NORMALIZER,
  CHANGE_MRR_EVENT_TYPE,
  TOTAL,
  DAILY_MRR,
  case when a.order_created is null then 0 else 1 end as val
  FROM prev_test a
FULL OUTER JOIN DIM_DATE dd ON a.ORDER_CREATED = dd.DATE_KEY
WHERE  dd.DATE_KEY >=(select min(a.ORDER_CREATED) from prev_test)
)tbl) x

数据库小提琴:http ://sqlfiddle.com/#!18/c7dc8/3/0

编辑:添加条件来过滤日期的起点。

于 2021-10-20T08:22:48.927 回答