3

这是@Erwin 对 Postgres 中的高效时间序列查询的回答中的后续问题。

为了简单起见,我将使用与该问题相同的表结构

id | widget_id | for_date | score |

最初的问题是获取某个范围内每个日期的每个小部件的分数。如果某个日期没有小部件条目,则显示该小部件上一个条目的分数。如果所有数据都包含在您查询的范围内,则使用交叉连接和窗口函数的解决方案效果很好。我的问题是我想要以前的分数,即使它在我们正在查看的日期范围之外。

示例数据:

INSERT INTO score (id, widget_id, for_date, score) values
(1, 1337, '2012-04-07', 52),
(2, 2222, '2012-05-05', 99),
(3, 1337, '2012-05-07', 112),
(4, 2222, '2012-05-07', 101);

当我查询 2012 年 5 月 5 日至 5 月 10 日的范围(即generate_series('2012-05-05'::date, '2012-05-10'::date, '1d'))时,我想得到以下信息:

DAY          WIDGET_ID  SCORE
May, 05 2012    1337    52
May, 05 2012    2222    99
May, 06 2012    1337    52
May, 06 2012    2222    99
May, 07 2012    1337    112
May, 07 2012    2222    101
May, 08 2012    1337    112
May, 08 2012    2222    101
May, 09 2012    1337    112
May, 09 2012    2222    101
May, 10 2012    1337    112
May, 10 2012    2222    101

迄今为止最好的解决方案(也是@Erwin)是:

SELECT a.day, a.widget_id, s.score
FROM  (
   SELECT d.day, w.widget_id
         ,max(s.for_date) OVER (PARTITION BY w.widget_id ORDER BY d.day) AS effective_date
   FROM  (SELECT generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date AS day) d
   CROSS  JOIN (SELECT DISTINCT widget_id FROM score) AS w
   LEFT   JOIN score s ON s.for_date = d.day AND s.widget_id = w.widget_id
   ) a
LEFT JOIN  score s ON s.for_date = a.effective_date AND s.widget_id = a.widget_id
ORDER BY a.day, a.widget_id;

但正如您在此SQL Fiddle中看到的那样,它在前两天为小部件 1337 生成空分数。我希望看到第 1 行中较早的 52 分代替它。

是否有可能以有效的方式做到这一点?

4

3 回答 3

1

就像你写的那样,你应该找到匹配的分数,但如果有差距 - 用最接近的分数填充它。在 SQL 中它将是:

SELECT d.day, w.widget_id, 
  coalesce(s.score, (select s2.score from score s2
    where s2.for_date<d.day and s2.widget_id=w.widget_id order by s2.for_date desc limit 1)) as score
from (select distinct widget_id FROM score) AS w
cross join (SELECT generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date AS day) d
left join score s ON (s.for_date = d.day AND s.widget_id = w.widget_id)
order by d.day, w.widget_id;

在这种情况下,合并意味着“如果存在差距”。

于 2013-10-18T06:25:21.017 回答
1

您可以distinct on在 PostgreSQL 中使用语法

with cte_d as (
    select generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date as day
), cte_w as (
    select distinct widget_id from score
)
select distinct on (d.day, w.widget_id)
    d.day, w.widget_id, s.score
from cte_d as d
    cross join cte_w as w
    left outer join score as s on s.widget_id = w.widget_id and s.for_date <= d.day
order by d.day, w.widget_id, s.for_date desc;

或通过子查询获取最大日期:

with cte_d as (
    select generate_series('2012-05-05'::date, '2012-05-10'::date, '1d')::date as day
), cte_w as (
    select distinct widget_id from score
)
select
    d.day, w.widget_id, s.score
from cte_d as d
    cross join cte_w as w
    left outer join score as s on s.widget_id = w.widget_id
where
    exists (
        select 1
        from score as tt
        where tt.widget_id = w.widget_id and tt.for_date <= d.day
        having max(tt.for_date) = s.for_date
    )
order by d.day, w.widget_id;

性能实际上取决于您在表上拥有的索引(widget_id, for_date如果可能的话是唯一的)。我认为,如果您每行有很多行,widget_id那么第二行会更有效率,但是您必须在数据上对其进行测试。

>> sql fiddle demo<<

于 2013-10-18T06:50:47.723 回答
1

正如@Roman 提到的,DISTINCT ON可以解决这个问题。此相关答案中的详细信息:

不过,子查询通常比 CTE 快一点:

SELECT DISTINCT ON (d.day, w.widget_id)
       d.day, w.widget_id, s.score
FROM   generate_series('2012-05-05'::date, '2012-05-10'::date, '1d') d(day)
CROSS  JOIN (SELECT DISTINCT widget_id FROM score) AS w
LEFT   JOIN score s ON s.widget_id = w.widget_id AND s.for_date <= d.day
ORDER  BY d.day, w.widget_id, s.for_date DESC;

您可以使用集合返回函数,如列表中的FROM表格。

SQL小提琴

一个多列索引应该是性能的关键:

CREATE INDEX score_multi_idx ON score (widget_id, for_date, score)

仅包含第三列score以使其成为Postgres 9.2 或更高版本中的覆盖索引。您不会将它包含在早期版本中。

当然,如果您有许多小部件和广泛的天数,则会CROSS JOIN产生很多行,这些行都有价格标签。仅选择您实际需要的小部件和日期。

于 2013-10-18T15:00:35.053 回答