4

作为一个简化的示例,我需要选择客户的送货地址与他们之前的送货地址不同的每个实例。所以我有一个大表,其中包含以下列:

purchase_id | cust_id | date | address  | description
-----------------------------------------------------------
 1          | 5       | jan  | address1 | desc1
 2          | 6       | jan  | address2 | desc2
 3          | 5       | feb  | address1 | desc3
 4          | 6       | feb  | address2 | desc4
 5          | 5       | mar  | address3 | desc5
 6          | 5       | mar  | address3 | desc6
 7          | 5       | apr  | address1 | desc7
 8          | 6       | may  | address4 | desc8

请注意,客户可以像客户 5 在第 7 行中所做的那样“移回”到以前的地址。

我想要选择(并且尽可能高效,因为这是一个非常大的表)是每个“块”中的第一行,其中客户将后续订单运送到同一地址。在此示例中,这将是第 1、2、5、7 和 8 行。在所有其他行中,客户的地址与其之前的订单相同。

所以我想先有效ORDER BY (cust_id, date),然后SELECT purchase_id, cust_id, min(date), address, description

但是我遇到了麻烦,因为 SQL 通常需要GROUP BY在之前完成ORDER BY。因此,我无法弄清楚如何调整这个问题的任何一个最佳答案(否则我非常喜欢。)有必要(至少在概念上)在分组或使用聚合函数之前按日期排序min(),否则我会错过像我的示例表中的第 7 行这样的实例,其中客户“移回”到以前的地址。

另请注意,两个客户可以共享一个地址,因此我需要在按日期订购后cust_id按两者有效分组。address

我正在使用雪花,我相信它与最新版本的 PostgreSQL 和 SQL Server 具有大部分相同的命令(尽管我对雪花还很陌生,所以不完全确定。)

4

4 回答 4

2

您可以使用row_number窗口函数来做到这一点:

;with cte as(select *, row_number() over(partition by cust_id, address
                                         order by purchase_id) as rn from table)
select * from cte 
where rn = 1
于 2016-04-18T04:16:22.257 回答
1

抱歉回复晚了。我打算几天前对这个帖子做出反应。

我能想到的“最合适”的方法是使用 LAG 函数。

拿着这个:

select purchase_id, cust_id, address, 
lag(address, 1) over (partition by cust_id order by purchase_id) prev_address 
from x order by cust_id, purchase_id;
-------------+---------+----------+--------------+
 PURCHASE_ID | CUST_ID | ADDRESS  | PREV_ADDRESS |
-------------+---------+----------+--------------+
 1           | 5       | address1 | [NULL]       |
 3           | 5       | address1 | address1     |
 5           | 5       | address3 | address1     |
 6           | 5       | address3 | address3     |
 7           | 5       | address1 | address3     |
 2           | 6       | address2 | [NULL]       |
 4           | 6       | address2 | address2     |
 8           | 6       | address4 | address2     |
-------------+---------+----------+--------------+

然后您可以轻松检测带有您描述的事件的行

select purchase_id, cust_id, address, prev_address from (
  select purchase_id, cust_id, address, 
  lag(address, 1) over (partition by cust_id order by purchase_id) prev_address 
  from x 
) sub 
where not equal_null(address, prev_address)
order by cust_id, purchase_id;
-------------+---------+----------+--------------+
 PURCHASE_ID | CUST_ID | ADDRESS  | PREV_ADDRESS |
-------------+---------+----------+--------------+
 1           | 5       | address1 | [NULL]       |
 5           | 5       | address3 | address1     |
 7           | 5       | address1 | address3     |
 2           | 6       | address2 | [NULL]       |
 8           | 6       | address4 | address2     |
-------------+---------+----------+--------------+

请注意,我使用 EQUAL_NULL 函数具有 NULL=NULL 语义。

请注意,虽然 LAG 函数可能是计算密集型的(但与之前提出的使用 ROW_NUMBER 相当)

于 2016-05-01T06:09:44.140 回答
0

这可能最好通过子查询来解决,以获取每个用户的第一次购买,然后使用IN基于该结果过滤行。

澄清一下,purchase_id是一个自动增量列,对吗?如果是这样,purchase_id则必须在以后创建更高的购买,并且以下内容就足够了:

SELECT *
FROM purchases
WHERE purchase_id IN (
  SELECT MIN(purchase_id) AS first_purchase_id
  FROM purchases
  GROUP BY cust_id
)

如果您只想为具有多个地址的客户进行首次购买,HAVING请在子查询中添加一个子句:

SELECT *
FROM purchases
WHERE purchase_id IN (
  SELECT MIN(purchase_id) AS first_purchase_id
  FROM purchases
  GROUP BY cust_id
  HAVING COUNT(DISTINCT address) > 1
)

小提琴:http ://sqlfiddle.com/#!9/12d75/6

但是,如果purchase_id不是自动增量列,则SELECT在两者cust_idmin(date)子查询上并使用INNER JOINon cust_idand min(date)

SELECT *
FROM purchases
INNER JOIN (
  SELECT cust_id, MIN(date) AS min_date
  FROM purchases
  GROUP BY cust_id
  HAVING COUNT(DISTINCT address) > 1
) cust_purchase_date
ON purchases.cust_id = cust_purchase_date.cust_id AND purchases.date = cust_purchase_date.min_date

但是,第一个查询示例可能会更快,因此如果您purchase_id是自动增量列,请使用它。

于 2016-04-18T03:59:42.013 回答
0

Snowflake 引入了CONDITIONAL_CHANGE_EVENT,它理想地解决了描述的情况:

当当前行中参数 expr1 的值与前一行中 expr1 的值不同时,返回窗口分区中每一行的窗口事件编号。窗口事件编号从 0 开始,并以 1 递增,以指示到目前为止该窗口内的更改次数


资料准备:

CREATE OR REPLACE TABLE t(purchase_id INT, cust_id INT,
                          date DATE, address TEXT, description TEXT);

INSERT INTO t(purchase_id, cust_id, date, address, description)
VALUES 
 ( 1, 5, '2021-01-01'::DATE ,'address1','desc1')
,( 2, 6, '2021-01-01'::DATE ,'address2','desc2')
,( 3, 5, '2021-02-01'::DATE ,'address1','desc3')
,( 4, 6, '2021-02-01'::DATE ,'address2','desc4')
,( 5, 5, '2021-03-01'::DATE ,'address3','desc5')
,( 6, 5, '2021-03-01'::DATE ,'address3','desc6')
,( 7, 5, '2021-04-01'::DATE ,'address1','desc7')
,( 8, 6, '2021-05-01'::DATE ,'address4','desc8');

询问:

SELECT *, 
 CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) AS CCE
FROM t
ORDER BY purchase_id;

在此处输入图像描述

一旦 subgroup:CCE列被识别,QUALIFY 可用于查找每个的第一行CUST_ID, CCE

完整查询:

WITH cte AS (
 SELECT *,
  CONDITIONAL_CHANGE_EVENT(address) OVER (PARTITION BY CUST_ID ORDER BY DATE) AS CCE
 FROM t
)
SELECT *
FROM  cte
QUALIFY ROW_NUMBER() OVER(PARTITION BY CUST_ID, CCE ORDER BY DATE) = 1
ORDER BY purchase_id;

输出:

在此处输入图像描述

于 2021-07-31T15:05:25.990 回答