7

我有一张代表产品使用情况的表格,有点像日志。产品使用记录为多个时间戳,我想使用时间范围表示相同的数据。

它看起来像这样(PostgreSQL 9.1):

userid | timestamp          | product
-------------------------------------
001    | 2012-04-23 9:12:05 | foo
001    | 2012-04-23 9:12:07 | foo
001    | 2012-04-23 9:12:09 | foo
001    | 2012-04-23 9:12:11 | barbaz
001    | 2012-04-23 9:12:13 | barbaz
001    | 2012-04-23 9:15:00 | barbaz
001    | 2012-04-23 9:15:01 | barbaz
002    | 2012-04-24 3:41:01 | foo
002    | 2012-04-24 3:41:03 | foo

我想折叠与上一次运行的时间差小于增量(例如:2 seconds)的行,并获取开始时间和结束时间,如下所示:

userid | begin              | end                | product
----------------------------------------------------------
001    | 2012-04-23 9:12:05 | 2012-04-23 9:12:09 | foo
001    | 2012-04-23 9:12:11 | 2012-04-23 9:12:13 | barbaz
001    | 2012-04-23 9:15:00 | 2012-04-23 9:15:01 | barbaz
002    | 2012-04-24 3:41:01 | 2012-04-24 3:41:03 | foo

请注意,如果同一产品的使用时间间隔超过delta(在本例中为 2 秒),则连续使用同一产品将分为两行。

create table t (userid int, timestamp timestamp, product text);

insert into t (userid, timestamp, product) values 
(001, '2012-04-23 9:12:05', 'foo'),
(001, '2012-04-23 9:12:07', 'foo'),
(001, '2012-04-23 9:12:09', 'foo'),
(001, '2012-04-23 9:12:11', 'barbaz'),
(001, '2012-04-23 9:12:13', 'barbaz'),
(001, '2012-04-23 9:15:00', 'barbaz'),
(001, '2012-04-23 9:15:01', 'barbaz'),
(002, '2012-04-24 3:41:01', 'foo'),
(002, '2012-04-24 3:41:03', 'foo')
;
4

1 回答 1

9

受此答案的启发,由@a_horse_with_no_name 给出了一段时间。

WITH groupped_t AS (
SELECT *, sum(grp_id) OVER (ORDER BY userid,product,"timestamp") AS grp_nr
  FROM (SELECT t.*,
          lag("timestamp") OVER
           (PARTITION BY userid,product ORDER BY "timestamp") AS prev_ts,
          CASE WHEN ("timestamp" - lag("timestamp") OVER
            (PARTITION BY userid,product ORDER BY "timestamp")) <= '2s'::interval
          THEN NULL ELSE 1 END AS grp_id
        FROM t) AS g
), periods AS (
SELECT min(gt."timestamp") AS grp_min, max(gt."timestamp") AS grp_max, grp_nr
  FROM groupped_t AS gt
 GROUP BY gt.grp_nr
)
SELECT gt.userid, p.grp_min AS "begin", p.grp_max AS "end", gt.product
  FROM periods p
  JOIN groupped_t gt ON gt.grp_nr = p.grp_nr AND gt."timestamp" = p.grp_min
 ORDER BY gt.userid, p.grp_min;
  1. 最里面的查询将根据userid,product和时间差分配分组 ID。PARTITION BY我认为实际上前两个字段应该是安全的。
  2. groupped_t给我所有的源列+一个额外的运行组号。我只ORDER BY在这里使用了sum()窗口功能,因为我需要组 ID 是唯一的。
  3. periods只是每个组中第一个和最后一个时间戳的帮助查询。
  4. 最后,我加入了groupped_t(这就是为什么我需要它是唯一的)和每个组中第一个条目的时间戳。periodsgrp_nr

您还可以在SQL Fiddle上查看此查询。

Note, that timestamp, begin and end are reserved words in the SQL (end also for PostgreSQL), so you should either avoid or double-quote them.

于 2012-06-25T15:33:37.043 回答