1

我正在寻找一种在 3 秒滑动窗口内识别 Oracle 11.2 表中类似记录的方法。在 24 小时内插入的表中大约有 500K 行。

要求:

  1. 这些记录应满足使用 UTL_MATCH.JARO_WINKLER_SIMILARITY 至少 88% 的相似性分数
  2. 如果在三秒窗口内至少存在一条类似记录,则应为给定记录更新 FLAG 列

表定义:

CREATE TABLE ADDR_TAB
  ( DT DATE NOT NULL,
    ADDR VARCHAR2(200) NOT NULL,
    FLAG INT
  );
CREATE INDEX ADDR_DATE_IDX ON ADDR_TAB(DT);

样本数据:

insert into addr_tab values (to_date('03-OCT-13 04.36.57 PM','DD-MON-RR HH.MI.SS AM'),'test',null);
insert into addr_tab values (to_date('03-OCT-13 04.36.57 PM','DD-MON-RR HH.MI.SS AM'),'test123',null);
insert into addr_tab values (to_date('03-OCT-13 04.36.58 PM','DD-MON-RR HH.MI.SS AM'),'2test2',null);
insert into addr_tab values (to_date('03-OCT-13 04.36.58 PM','DD-MON-RR HH.MI.SS AM'),'12test',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.00 PM','DD-MON-RR HH.MI.SS AM'),'12test',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.02 PM','DD-MON-RR HH.MI.SS AM'),'12test',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.03 PM','DD-MON-RR HH.MI.SS AM'),'1test87',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.03 PM','DD-MON-RR HH.MI.SS AM'),'12test',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.03 PM','DD-MON-RR HH.MI.SS AM'),'12test56',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.04 PM','DD-MON-RR HH.MI.SS AM'),'12test88',null);
insert into addr_tab values (to_date('03-OCT-13 04.37.05 PM','DD-MON-RR HH.MI.SS AM'),'12test56',null);

SQLFiddle:http ://sqlfiddle.com/#!4/1b53f/1

4

1 回答 1

1

尝试这个:

WITH basedata 
     AS (SELECT * 
         FROM   addr_tab 
         ORDER  BY dt) 
SELECT * 
FROM   basedata A 
WHERE  EXISTS (SELECT 1 
               FROM   basedata B 
               WHERE  ( ( A.dt <= ( B.dt + 3 / 86400 ) ) 
                        AND ( A.dt >= ( B.dt - 3 / 86400 ) ) ) 
                      AND a.ROWID <> b.ROWID 
                      AND utl_match.Jaro_winkler(A.addr, B.addr) * 100 >= 88) 

SQLFiddle:http ://sqlfiddle.com/#!4/1b53f/22

于 2013-10-16T01:13:40.070 回答