sql - 基于字段组合在SQL中查找重复记录

Question

我有一个项目，其中一个重要的部分是确定记录在数据库中的重复位置（Sql Server 2005）。我知道找到重复记录的明显方法。但是，在这种情况下，我们希望对该过程相当聪明。表格将包含有关潜在客户（潜在客户）的信息。初始表将接受所有潜在客户。然后，我们将完成一个重复过程，该过程将通过匹配多个字段来检查潜在客户是否重复。例如，我们可能想要匹配姓氏、名字、电子邮件和邮政编码。这只是一个例子，但本质上我们想使用各种字段创建一个密钥来知道这个人是否存在。不是被骗的记录将进入决赛桌。

我想为此使用 SSIS，但不确定使用 SSIS 完成此操作的最佳方法。有人可以引导我朝着正确的方向前进，或者提供一个链接到一个使用 SSIS 通过检查字段组合来处理欺骗的示例的链接吗？

score 2 · Accepted Answer

在我看来，您试图通过猜测用户来智取用户。不幸的是，这几乎永远不会奏效，因为您可能确实有两个名称相同但邮政编码不同的客户端，或者其他类似的示例。

您最好的选择是“建议”他们将要保存的客户已经存在（并向他们展示副本），但无论如何都允许他们保存。所以这个过程可能需要看起来像这样：

用户输入信息并按保存
系统使用标准检测潜在的重复，并提示用户
用户将取消或确认，然后您将采取适当的行动

如果没有潜在的重复，则可以安全地跳过步骤 2-3。

score 1 · Accepted Answer

-- List all Duplicates
select m1.lastname, m1.firstname, m1.email, m1.zipcode
from tblMain m1
inner join tblMain m2
on isnull(m1.lastname, '') = isnull(m2.lastname, '')
and isnull(m1.firstname, '') = isnull(m2.firstname, '')
and isnull(m1.email, '') = isnull(m2.email, '')
and isnull(m1.zipcode, '') = isnull(m2.zipcode)
and m1.ID <> m2.ID
order by 1, 2, 3, 4

要删除最新的重复项，请使用以下内容：

delete from tblMain
where ID in 
(
    select m1.ID
    from tblMain m1
    inner join tblMain m2
    on isnull(m1.lastname, '') = isnull(m2.lastname, '')
    and isnull(m1.firstname, '') = isnull(m2.firstname, '')
    and isnull(m1.email, '') = isnull(m2.email, '')
    and isnull(m1.zipcode, '') = isnull(m2.zipcode)
    and m1.ID > m2.ID
)

score 0 · Accepted Answer

我不明白你怎么能确定 SSIS 是你问题的答案。为什么不能简单地在“最终”表中创建唯一键以确保不添加重复项？也许你应该更好地解释你的问题......

sql - 基于字段组合在SQL中查找重复记录

3 回答 3

Related

Reference