sql - 如何在 SQL 中使用客户端名称和地址来识别重复记录，而它们都是自由文本

Question

我有一个包含数百万客户联系人的数据库。但是，其中很多是重复的，我可以请这里的一些英雄建议如何使用 Oracle SQL、PL/SQL 或 Excel 识别这些重复。

下面是数据结构：

Client_Header

id integer (Primary Key)
Client_First_Name (varchar2)
Client_Last_Name (varchar2)
Client_Date_Of_Birth (timestamp)

客户地址

Client_Id (Foreign Key ref Client_header)
Address_Line1 (varchar2)
Address_Line2 (varhchar2)
Adderss_Line3 (varchar2)
Suburb (Varchar2)
State (varchar2)
Country (varchar2)

我的挑战是除了Client_Date_Of_Birth那些关键领域，所有领域都只是自由文本。

例如，我们有一个像下面这样的客户

Surname : Jones

First name : David

Client_Date_Of_Birth: 10/05/1975

Address: Unit 10 Floor 1, 20 Railway Parade, St Peter,  NSW 2044

但是，由于这些字段是自由文本，我有很多数据问题，下面的链接（仅限 jpeg 文件）说明了其中一些问题

数据问题示例

笔记：

除了这些问题，有时我们也可能会错过客户的名字或姓氏（但不能同时错过两者）
有时可以在同一记录中找到多个问题。
有时，地址可能只是学校、购物中心等的名称。
系统不存储任何其他可以唯一标识客户端的 id。

我知道在客户地址是学校或购物中心的情况下收集所有重复记录几乎是不可能的。但是，对于其他情况，无论如何都可以识别大部分重复项。

谢谢您的帮助！

score 1 · Accepted Answer

不是一个漂亮的景象，恐怕我没有好消息要告诉你。

这是数据库中的常见问题，尤其是在数据录入人员培训不足的情况下。数据输入培训的主要目标之一是使问题得到很好的理解并展示避免它的方法。将来要记住的事情。

不幸的是，没有任何“魔杖”可以为您清理数据。很抱歉，您面临着数据库维护中最乏味的任务之一。您将不得不手动删除重复项，而且这项工作需要更多的编辑而不是数据库管理员。

如果您有数百万条记录，其中可能有一百万条实际上是重复的，我估计需要一名全职工作的专家至少两年（可能更长时间）才能清理您的问题：在两年将需要每天修复 2000 条记录，周末休息和两周假期。

最后，删除所有重复项的唯一可靠方法是比较所有重复项并一次删除它们。但是有很多技巧可以用来一次摆脱它们的块。以下是我可以通过您的数据样本想到的一些内容：

在名字和姓氏字段中将“Dave”更改为“David”。（确保实际上没有人拥有姓氏“Dave”。）
将“Jones David”的所有实例更改为“David Jones”。（确保没有名为“Jones David”的人。）
将“1/F”更改为“1 楼”。

这个想法是专注于某些领域，并在这些领域中让所有重复项都成为精确重复项。完成此操作后，删除字段中具有目标值的所有记录，但具有要保留的记录的主键的记录除外（如果您的表没有键控，则必须找到另一个方法，例如将顶部记录选择到新表中）。

这种技术可以加快处理大量重复记录的速度。如果您只有几个重复项，则只需逐个识别它们会更快。快速执行此操作的一种方法是进入表的编辑模式，使用特定字段（例如，在本例中为邮政编码字段），并在您想要将其标记为删除时在该字段中放置一个唯一值（在这种情况下，可能是一个零）。然后，您可以定期删除字段中具有该值的所有记录。

您还需要以多种方式对数据进行排序以找到您已经知道的重复项。

至于你的笔记，不要试图找出数据混乱的所有方式。一旦您将一条记录识别为另一条记录的副本，您就不在乎它有什么问题，您只需要摆脱它即可。如果您有两条记录，并且每条都包含要保留而另一条丢失的数据，那么您必须合并它们并删除其中一条。然后继续下一个，下一个，下一个……

score 0 · Accepted Answer

Some years ago I had a similar task and I tooks about one years to clean the data. What I did in short:

send the address to api.addressdoctor.com for validation and split into single fields (with maps.googleapis.com it is also possible)
use a first name and last name match list to check the names (we used namepedia.org). A lot depends on the quality of this list. This list should base on country of birth or of the first address. From the results we made a propability what kind of name it is (first/last/company).
with this improved date you should create some normalized and fuzzy attributes. Normalized fields from names and address...like upper and just with alpha-numeric
List item
at the end I would change the data model a little bit to improve the data quality by design. I recommend you adding pre-title, post-title, middle-name and post-name fields. You should also add the splitted address fields like street, streetno, zip, location, longitude, latitude, etc... I would also change the relation between Client_Header and Client_Address with an extra address_Id as primary key...but this depends on the requirements. And at the end I would add some constraints to prevent duplicated entries.
after all that is the deduplication not hard. Group just all normalized or fuzzy data together and greate a dense_rank. (I group by person, household, ...) Make a ranking over the attributes (I used data quality, data fillrate and transaction history for a score value) Finally it is your choice if you just want to delete the duplicates and copy the corresponding data to the living client or virtually connect the data via Client_Id in an extra Field.
for insert and update processes you should create PL/SQL functions that check if fuzzy last-name (eg. first-name) + fuzzy address exist. Split the names and address fileds and check them with the address API's and match them with the names reference. If it is a single tuple data entry, show the best results to the user and let him decide.

sql - 如何在 SQL 中使用客户端名称和地址来识别重复记录，而它们都是自由文本

2 回答 2

Related

Reference