0

我正在使用Oracle11g,我会比较两个表,找到它们之间匹配的记录。

例子:

Table 1        Table 2

George         Micheal
Michael        Paul

他们之间的记录“迈克尔”和“迈克尔”相匹配,所以他们是很好的记录。

要查看两条记录是否匹配,我使用Oracle函数utl_match.edit_distance_similarity.

我尝试使用下面的代码,但我遇到了性能问题(它太慢了):

SELECT * 
FROM table1
JOIN table2
ON utl_match.edit_distance_similarity(table1.name, table2.name) > 75;

有更好的解决方案吗?

谢谢

4

2 回答 2

1

这是一个难题。一般来说,它会导致嵌套循环连接和缓慢。可以使用SOUNDEX()获得“接近”匹配,然后使用字符距离函数进行最终过滤。这可能不适用于您的问题,但可能。

虽然我不是该功能的忠实拥护者,但您可能会发现它soundex()适用于您的目的(请参阅此处)。

这个想法是在这个值上添加一个索引:

create index idx_table1_soundexname on table1(soundex(name));
create index idx_table2_soundexname on table2(soundex(name));

然后你会这样查询:

SELECT * 
FROM table1 t1 JOIN
     table2 t2
     ON soundex(t1.name) = soundex(t2.name)
WHERE utl_match.edit_distance_similarity(t1.name, t2.name) > 75;

这个想法是,Oracle 将使用索引来获取“接近”的名称,然后使用编辑距离来获得更好的匹配。这可能不适用于您的问题。这只是一个可行的想法。

于 2017-02-02T11:49:08.853 回答
1

如果您的表 table1 和 table2 中的名称值有很多冗余,这可能是一个解决方案

-- Test data set

select count(*) from table1;
--> 10.000

select count(*) from table2;
--> 10.000

select count(distinct(name)) from table1;
--> ~ 2500

select count(distinct(name)) from table2;
--> ~ 2500

/* a) Join with function compare */

select table1.name, table2.name
  from table1, table2
 where utl_match.edit_distance_similarity(table1.name, table2.name) > 35

/*

--------------------------------------------------------------------------------
| Id  | Operation            | Name   | Rows    | Bytes     | Cost  | Time     |
--------------------------------------------------------------------------------
|   0 | SELECT STATEMENT     |        | 5000000 | 270000000 | 37364 | 00:09:21 |
|   1 |   NESTED LOOPS       |        | 5000000 | 270000000 | 37364 | 00:09:21 |
|   2 |    TABLE ACCESS FULL | TABLE1 |   10000 |    270000 |     5 | 00:00:01 |
| * 3 |    TABLE ACCESS FULL | TABLE2 |     500 |     13500 |     4 | 00:00:01 |
--------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 3 - filter("UTL_MATCH"."EDIT_DISTANCE_SIMILARITY"("TABLE1"."NAME","TABLE2"."NAME")>35)


Note
-----
- dynamic sampling used for this statement

*/

/* b) Join with function, only distinct values */

-- A Set of all existing names (in table1 and table2)
 with names as
 (select name from table1 union select name from table2),

-- Compare only once because utl_match.edit_distance_similarity(name1, name2) = utl_match.edit_distance_similarity(name2, name1)
 table_cmp(name1, name2) as
 (select n1.name, n2.name
          from names n1
          join names n2
            on n1.name <= n2.name
           and utl_match.edit_distance_similarity(n1.name, n2.name) > 35)

  select t1.*, t2.*
          from table_cmp c
          join table1 t1
            on t1.name = c.name1
          join table2 t2
            on t2.name = c.name2
        union all
        select t1.*, t2.*
          from table_cmp c
          join table1 t1
            on t1.name = c.name2
          join table2 t2
            on t2.name = c.name1;


/*

--------------------------------------------------------------------------------------------------------------
| Id   | Operation                   | Name                        | Rows     | Bytes      | Cost | Time     |
--------------------------------------------------------------------------------------------------------------
|    0 | SELECT STATEMENT            |                             | 30469950 | 3290754600 | 2495 | 00:00:38 |
|    1 |   TEMP TABLE TRANSFORMATION |                             |          |            |      |          |
|    2 |    LOAD AS SELECT           | SYS_TEMP_0FD9D663E_B39FC2B6 |          |            |      |          |
|    3 |     SORT UNIQUE             |                             |    20000 |     540000 |   12 | 00:00:01 |
|    4 |      UNION-ALL              |                             |          |            |      |          |
|    5 |       TABLE ACCESS FULL     | TABLE1                      |    10000 |     270000 |    5 | 00:00:01 |
|    6 |       TABLE ACCESS FULL     | TABLE2                      |    10000 |     270000 |    5 | 00:00:01 |
|    7 |    LOAD AS SELECT           | SYS_TEMP_0FD9D663F_B39FC2B6 |          |            |      |          |
|    8 |     MERGE JOIN              |                             |  1000000 |   54000000 |   62 | 00:00:01 |
|    9 |      SORT JOIN              |                             |    20000 |     540000 |    3 | 00:00:01 |
|   10 |       VIEW                  |                             |    20000 |     540000 |    2 | 00:00:01 |
|   11 |        TABLE ACCESS FULL    | SYS_TEMP_0FD9D663E_B39FC2B6 |    20000 |     540000 |    2 | 00:00:01 |
| * 12 |      FILTER                 |                             |          |            |      |          |
| * 13 |       SORT JOIN             |                             |    20000 |     540000 |    3 | 00:00:01 |
|   14 |        VIEW                 |                             |    20000 |     540000 |    2 | 00:00:01 |
|   15 |         TABLE ACCESS FULL   | SYS_TEMP_0FD9D663E_B39FC2B6 |    20000 |     540000 |    2 | 00:00:01 |
|   16 |    UNION-ALL                |                             |          |            |      |          |
| * 17 |     HASH JOIN               |                             | 15234975 | 1645377300 | 1248 | 00:00:19 |
|   18 |      TABLE ACCESS FULL      | TABLE2                      |    10000 |     270000 |    5 | 00:00:01 |
| * 19 |      HASH JOIN              |                             |  3903201 |  316159281 | 1200 | 00:00:18 |
|   20 |       TABLE ACCESS FULL     | TABLE1                      |    10000 |     270000 |    5 | 00:00:01 |
|   21 |       VIEW                  |                             |  1000000 |   54000000 | 1183 | 00:00:18 |
|   22 |        TABLE ACCESS FULL    | SYS_TEMP_0FD9D663F_B39FC2B6 |  1000000 |   54000000 | 1183 | 00:00:18 |
| * 23 |     HASH JOIN               |                             | 15234975 | 1645377300 | 1248 | 00:00:19 |
|   24 |      TABLE ACCESS FULL      | TABLE2                      |    10000 |     270000 |    5 | 00:00:01 |
| * 25 |      HASH JOIN              |                             |  3903201 |  316159281 | 1200 | 00:00:18 |
|   26 |       TABLE ACCESS FULL     | TABLE1                      |    10000 |     270000 |    5 | 00:00:01 |
|   27 |       VIEW                  |                             |  1000000 |   54000000 | 1183 | 00:00:18 |
|   28 |        TABLE ACCESS FULL    | SYS_TEMP_0FD9D663F_B39FC2B6 |  1000000 |   54000000 | 1183 | 00:00:18 |
--------------------------------------------------------------------------------------------------------------

Predicate Information (identified by operation id):
------------------------------------------
* 12 - filter("UTL_MATCH"."EDIT_DISTANCE_SIMILARITY"("N1"."NAME","N2"."NAME")>35)
* 13 - access("N1"."NAME"<="N2"."NAME")
* 13 - filter("N1"."NAME"<="N2"."NAME")
* 17 - access("T2"."NAME"="C"."NAME2")
* 19 - access("T1"."NAME"="C"."NAME1")
* 23 - access("T2"."NAME"="C"."NAME1")
* 25 - access("T1"."NAME"="C"."NAME2")


Note
-----
- dynamic sampling used for this statement

*/
于 2017-02-02T11:55:57.150 回答