mysql - 改进使用 MySQL 查找模糊重复项

Question

我有一张由于操作员输入错误而引起重复记录的名称或公司或产品表。

我正在尝试创建一个工具来管理这个问题。它不会是一个高流量的页面，但它仍然不应该在构建记录集时杀死数据库。我有这个查询，需要几分钟来处理（太长了）：

SELECT 
    tab1.id as id1, 
    tab1.creative as creative1, 
    tab2.id as id2, 
    tab2.creative as creative2 
FROM 
    creatives tab1, 
    creatives tab2 
WHERE 
    SOUNDEX(tab1.creative)= SOUNDEX(tab2.creative) 
AND 
    tab1.id<>tab2.id 
AND 
    tab1.id=(
        SELECT 
            MAX(id) 
        FROM 
            creatives tab 
        WHERE 
            SOUNDEX(tab.creative)=SOUNDEX(tab1.creative))

现在除了花费太长时间来返回结果之外，结果有时有点太模糊了。例如，很高兴找到这些：

Convenery of Trades of Edinburgh | Convenery of Trades of Edinbubrgh
Crowdedlogic Theatre Company | Crowded Logic Theatre Company

但这些似乎很遥远：

Daniel Cope | Dan Willis & Obie
David Williams | David Holmes

有没有更快、更不模糊的方法来做到这一点？

score 7 · Accepted Answer

你的问题有两个部分。

首先，为什么你的查询很慢？

第二，为什么SOUNDEX()会有太多的误报匹配？有没有更好的方法来提出近乎匹配的比SOUNDEX()？

让我们一次接一个。

首先，让我们尝试加速这个查询。让我们首先在标准 SQL 中重铸它（消除旧式 JOIN）。

SELECT 
    tab1.id as id1, 
    tab1.creative as creative1, 
    tab2.id as id2, 
    tab2.creative as creative2 
FROM creatives AS tab1 
JOIN creatives tab2 
  ON (
          tab1.id < tab2.id   /* don't duplicate pairs a/b b/a */  
      AND SOUNDEX(tab1.creative)= SOUNDEX(tab2.creative) 
     )

现在让我们忽略查询的最后一个子句。

如您所见，如果您的表中有 n 行，这将按 (n 平方) 次的顺序评估 SOUNDEX 函数。

我建议您在表格中添加一个新列。使其成为文本字符串。称它为 compare_hash。

然后像这样填充它：

UPDATE creatives 
   SET compare_hash = SOUNDEX(creative)

然后索引它。

然后运行这个查询：

    SELECT 
    tab1.id as id1, 
    tab1.creative as creative1, 
    tab2.id as id2, 
    tab2.creative as creative2 
FROM creatives AS tab1 
JOIN creatives tab2 
  ON (
          tab1.id < tab2.id   /* don't duplicate pairs a/b b/a */  
      AND tab1.compare_hash = tab2.compare_hash
     )

这应该快很多，因为它可以使用索引。

On to your second problem. Look, here's the deal: SOUNDEX() is designed to get lots of false positives. It's also designed for American English names. It's an old-timey Bell System telephone company information operator function designed to show multiple names when people ask for Bessie Schmidt when they want Bessie Smith.

You need to cook up some other comparison-hash functions and experiment with them. The thing that's cool about your extra table column is that you can do that like this. This example converts all your strings to lower case, then pulls out the spaces, then each of the vowels. So, the hash for "David Williams" will be "dvdwllms" which is different from "dvdhlms"

UPDATE creatives  SET compare_hash = LOWER(creative);
UPDATE creatives  SET compare_hash = REPLACE(compare_hash , ' ', '');
UPDATE creatives  SET compare_hash = REPLACE(compare_hash , 'a', '');
UPDATE creatives  SET compare_hash = REPLACE(compare_hash , 'e', '');
UPDATE creatives  SET compare_hash = REPLACE(compare_hash , 'i', '');
UPDATE creatives  SET compare_hash = REPLACE(compare_hash , 'o', '');
UPDATE creatives  SET compare_hash = REPLACE(compare_hash , 'u', '');

Once you've made this compare_hash, you can run the same self join query.

(I've tried Levenshein distance for this kind of thing. The problem there is getting a consistent sameness metric for pairs of strings of different lengths.)

It's going to take some messing around and some elbow grease to get this done. How much programming you put into programming it depends, well, on http://xkcd.com/1205/

mysql - 改进使用 MySQL 查找模糊重复项

1 回答 1

Related

Reference