sql - Oracle：查找预测排名列表的先前记录

Question

嗨，我面临一个难题：

我有一张天气预报表（oracle 9i）（数百万条记录），其组成如下：

stationid    forecastdate    forecastinterval    forecastcreated    forecastvalue
---------------------------------------------------------------------------------
varchar (pk) datetime (pk)   integer (pk)        datetime (pk)      integer

在哪里：

stationid指可以创建预报的众多气象站之一；
forecastdate指预测的日期（仅限日期而不是时间。）
forecastinterval指forecastdate预测中的小时 (0 - 23)。
forecastcreated指做出预测的时间，可以提前很多天。
forecastvalue指预测的实际值（顾名思义）。

我需要确定给定stationid和给定forecastdate和forecastinterval对的记录，其中 a 的forecastvalue增量超过名义数字（例如 500）。我将在此处显示条件表：

stationid    forecastdate    forecastinterval    forecastcreated    forecastvalue
---------------------------------------------------------------------------------
'stationa'   13-dec-09       10                  10-dec-09 04:50:10  0
'stationa'   13-dec-09       10                  10-dec-09 17:06:13  0
'stationa'   13-dec-09       10                  12-dec-09 05:20:50  300
'stationa'   13-dec-09       10                  13-dec-09 09:20:50  300

在上述情况下，我想提取第三条记录。这是预测值增加了名义（比如 100）数量的记录。

由于表的庞大规模（数以百万计的记录），并且需要很长时间才能完成（事实上时间很长，以至于我的查询从未返回），因此这项任务被证明是非常困难的。

到目前为止，这是我获取这些值的尝试：

select
    wtr.stationid,
    wtr.forecastcreated,
    wtr.forecastvalue,
    (wtr.forecastdate + wtr.forecastinterval / 24) fcst_date
from
    (select inner.*
            rank() over (partition by stationid, 
                                   (inner.forecastdate + inner.forecastinterval),
                                   inner.forecastcreated
                         order by stationid, 
                                  (inner.forecastdate + inner.forecastinterval) asc,
                                  inner.forecastcreated asc
            ) rk
      from weathertable inner) wtr 
      where
      wtr.forecastvalue - 100 > (
                     select lastvalue
                      from (select y.*,
                            rank() over (partition by stationid, 
                                            (forecastdate + forecastinterval),
                                            forecastcreated
                                         order by stationid, 
                                           (forecastdate + forecastinterval) asc,
                                           forecastcreated asc) rk
                             from weathertable y
                            ) z
                       where z.stationid = wtr.stationid
                             and z.forecastdate = wtr.forecastdate                                                   
                             and (z.forecastinterval =    
                                         wtr.forecastinterval)
/* here is where i try to get the 'previous' forecast value.*/
                             and wtr.rk = z.rk + 1)

score 1 · Accepted Answer

Rexem 建议使用 LAG() 是正确的方法，但我们需要使用分区子句。一旦我们为不同的时间间隔和不同的站点添加行，这一点就会变得清晰......

SQL> select * from t
  2  /    
STATIONID  FORECASTDATE INTERVAL FORECASTCREATED     FORECASTVALUE
---------- ------------ -------- ------------------- -------------
stationa   13-12-2009         10 10-12-2009 04:50:10             0
stationa   13-12-2009         10 10-12-2009 17:06:13             0
stationa   13-12-2009         10 12-12-2009 05:20:50           300
stationa   13-12-2009         10 13-12-2009 09:20:50           300
stationa   13-12-2009         11 13-12-2009 09:20:50           400
stationb   13-12-2009         11 13-12-2009 09:20:50           500

6 rows selected.

SQL> SELECT v.stationid,
  2         v.forecastcreated,
  3         v.forecastvalue,
  4         (v.forecastdate + v.forecastinterval / 24) fcst_date
  5    FROM (SELECT t.stationid,
  6                 t.forecastdate,
  7                 t.forecastinterval,
  8                 t.forecastcreated,
  9                 t.forecastvalue,
 10                 t.forecastvalue - LAG(t.forecastvalue, 1)
 11                      OVER (ORDER BY t.forecastcreated) as difference
 12            FROM t) v
 13   WHERE v.difference >= 100
 14  /    
STATIONID  FORECASTCREATED     FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa   12-12-2009 05:20:50           300 13-12-2009 10:00:00
stationa   13-12-2009 09:20:50           400 13-12-2009 11:00:00
stationb   13-12-2009 09:20:50           500 13-12-2009 11:00:00

SQL>

为了消除误报，我们将 LAG() 按 STATIONID、FORECASTDATE 和 FORECASTINTERVAL 分组。请注意，以下内容依赖于从每个分区窗口的第一次计算中返回 NULL 的内部查询。

SQL> SELECT v.stationid,
  2         v.forecastcreated,
  3         v.forecastvalue,
  4         (v.forecastdate + v.forecastinterval / 24) fcst_date
  5    FROM (SELECT t.stationid,
  6                 t.forecastdate,
  7                 t.forecastinterval,
  8                 t.forecastcreated,
  9                 t.forecastvalue,
 10                 t.forecastvalue - LAG(t.forecastvalue, 1)
 11                      OVER (PARTITION BY t.stationid
 12                                         , t.forecastdate
 13                                         , t.forecastinterval
 14                            ORDER BY t.forecastcreated) as difference
 15            FROM t) v
 16   WHERE v.difference >= 100
 17  /

STATIONID  FORECASTCREATED     FORECASTVALUE FCST_DATE
---------- ------------------- ------------- -------------------
stationa   12-12-2009 05:20:50           300 13-12-2009 10:00:00

SQL>

处理大量数据

您将表描述为包含数亿行。如此巨大的桌子就像黑洞，它们有不同的物理特性。根据您的需求、时间尺度、财务状况、数据库版本和版本以及系统数据的任何其他用途，有多种可能的方法。这是超过五分钟的答案。

但无论如何，这是五分钟的答案。

假设您的表是实时表，它可能是通过在预测发生时添加预测来填充的，这基本上是一个附加操作。这意味着任何给定站点的预测都分散在整个表格中。因此，仅 STATIONID 甚至 FORECASTDATE 上的索引将具有较差的聚类因子。

基于这种假设，我建议您首先尝试的一件事是在(STATIONID, FORCASTDATE, FORECASTINTERVAL, FORECASTCREATED, FORECASTVALUE). 这将需要一些时间（和磁盘空间）来构建，但它应该会大大加快您的后续查询，因为它具有满足使用 INDEX RANGE SCAN 的查询所需的所有列，而根本不接触表。

sql - Oracle：查找预测排名列表的先前记录

1 回答 1

Related

Reference