4

我正在尝试将一个表中的一组县名与另一个表中的县名连接起来。这里的问题是,两个表中的县名都没有标准化。它们的数量不同;此外,它们可能并不总是以类似的模式出现。例如,“表 A”中的县“SAINT JOHNS”可以表示为“表 B”中的“ST JOHNS”。我们无法预测他们的共同模式。

这意味着,我们不能=在加入时使用“等于”() 条件。所以,我正在尝试使用JARO_WINKLER_SIMILARITYoracle 中的功能加入他们。我的左外连接条件如下:

Table_A.State = Table_B.State 
AND UTL_MATCH.JARO_WINKLER_SIMILARITY(Table_A.County_Name,Table_B.County_Name)>=80

在对结果进行一些测试后,我给出了 80 的测量值,它似乎是最佳的。在这里,问题是我在加入时遇到了一组“误报”。例如,如果在同一州下有一些名称相似的县(例如“BARRY”和“BAY”),如果度量为 ,它们将被匹配>=80。这会产生不准确的连接数据集。任何人都可以建议一些解决?

谢谢, DAV

4

2 回答 2

3

你能帮我建立一个查询,为表 B/C/D 中的每条记录查找 Table_A,并匹配 A 中具有最高相似度 >=80 的县名

甲骨文设置

CREATE TABLE official_words ( word ) AS
  SELECT 'SAINT JOHNS' FROM DUAL UNION ALL
  SELECT 'MONTGOMERY' FROM DUAL UNION ALL
  SELECT 'MONROE' FROM DUAL UNION ALL
  SELECT 'SAINT JAMES' FROM DUAL UNION ALL
  SELECT 'BOTANY BAY' FROM DUAL;

CREATE TABLE words_to_match ( word ) AS
  SELECT 'SAINT JOHN' FROM DUAL UNION ALL
  SELECT 'ST JAMES' FROM DUAL UNION ALL
  SELECT 'MONTGOMERY BAY' FROM DUAL UNION ALL
  SELECT 'MONROE ST' FROM DUAL;

查询

SELECT *
FROM   (
  SELECT wtm.word,
         ow.word AS official_word,
         UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) AS similarity,
         ROW_NUMBER() OVER ( PARTITION BY wtm.word ORDER BY UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word ) DESC ) AS rn
  FROM   words_to_match wtm
         INNER JOIN
         official_words ow
         ON ( UTL_MATCH.JARO_WINKLER_SIMILARITY( wtm.word, ow.word )>=80 )
)
WHERE rn = 1;

输出

WORD           OFFICIAL_WO SIMILARITY         RN
-------------- ----------- ---------- ----------
MONROE ST      MONROE              93          1
MONTGOMERY BAY MONTGOMERY          94          1
SAINT JOHN     SAINT JOHNS         98          1
ST JAMES       SAINT JAMES         80          1
于 2017-04-28T11:36:48.430 回答
1

内联使用一些组成的测试数据(您可以使用自己的 TABLE_A 和 TABLE_B 代替前两个with子句,并从 开始with matches as ...):

with table_a (state, county_name) as
     ( select 'A', 'ST JOHNS' from dual union all
       select 'A', 'BARRY' from dual union all
       select 'B', 'CHEESECAKE' from dual union all
       select 'B', 'WAFFLES' from dual union all
       select 'C', 'UMBRELLAS' from dual )
   , table_b (state, county_name) as
     ( select 'A', 'SAINT JOHNS' from dual union all
       select 'A', 'SAINT JOANS' from dual union all
       select 'A', 'BARRY' from dual union all
       select 'A', 'BARRIERS' from dual union all
       select 'A', 'BANANA' from dual union all
       select 'A', 'BANOFFEE' from dual union all
       select 'B', 'CHEESE' from dual union all
       select 'B', 'CHIPS' from dual union all
       select 'B', 'CHICKENS' from dual union all
       select 'B', 'WAFFLING' from dual union all
       select 'B', 'KITTENS' from dual union all
       select 'C', 'PUPPIES' from dual union all
       select 'C', 'UMBRIA' from dual union all
       select 'C', 'UMBRELLAS' from dual )
   , matches as
     ( select a.state, a.county_name, b.county_name as matched_name
            , utl_match.jaro_winkler_similarity(a.county_name,b.county_name) as score
       from   table_a a
              join table_b b on b.state = a.state  )
   , ranked_matches as
     ( select m.*
            , rank() over (partition by m.state, m.county_name order by m.score desc) as ranking
       from   matches m
       where  score > 50 )
select rm.state, rm.county_name, rm. matched_name, rm.score
from   ranked_matches rm
where  ranking = 1
order by 1,2;

结果:

STATE COUNTY_NAME MATCHED_NAME      SCORE
----- ----------- ------------ ----------
A     BARRY       BARRY               100
A     ST JOHNS    SAINT JOHNS          80
B     CHEESECAKE  CHEESE               92
B     WAFFLES     WAFFLING             86
C     UMBRELLAS   UMBRELLAS           100

这个想法是matches计算所有分数,在 ( , )ranked_matches内为它们分配一个序列,最终查询选择所有得分最高的人(即过滤器)。statecounty_nameranking = 1

您可能仍然会得到一些重复,因为没有什么可以阻止两个不同的模糊匹配得分相同。

于 2017-04-28T11:57:25.407 回答