sql - Listing duplicated records using T SQL

Question

I have a database that is used to record patient information for a small clinic. We use MS SQL Server 2008 as the backend. The patient table contains the following columns:

Id int identity(1,1), 
FamilyName varchar(30),
FirstName varchar (20), 
DOB datetime, 
AddressLine1 varchar (50), 
AddressLine2 varchar (50), 
State varchar (20), 
Postcode varchar (4), 
NextOfKin varchar (20), 
Homephone varchar (20), 
Mobile varchar (20)

Occasionally the staff register a new patient, unaware that the patient already has a record in the system. We end up with several thousands duplicated records.

What I would like to do is to present a list of patients who have duplicated records for the staff to merge during quiet time. We consider 2 records to be duplicated if the 2 records have exactly the same FamilyName, FirstName and DOB. What I am doing at the moment is to use a sub query to return the records as follow:

SELECT FamilyName, 
       FirstName, 
       DOB, 
       AddressLine1, 
       AddressLine2, 
       State, 
       Postcode, 
       NextOfKin, 
       HomePhone,
       Mobile 
FROM
Patients AS p1 
WHERE Id IN 
          ( 
            SELECT Max(Id) 
            FROM Patients AS p2, 
            COUNT(id) AS NumberOfDuplicate 
            GROUP BY    
            FamilyName, 
            FirstName, 
            DOB HAVING COUNT(Id) > 1
          )

This produces the result but the performance is terrible. Is there any better way to do it? The only requirements is I need to show all the fields in the Patients table as the user of the system wants to view all the details before making the decision whether to merge the records or not.

score 0 · Accepted Answer

WITH CTE 
AS
(
SELECT Id, FamilyName, FirstName ,DOB
ROW_NUMBER() OVER(PARTITION BY FamilyName, FirstName ,DOB ORDER BY Id) AS DuplicateCount
FROM PatientTable
)
select * from CTE where DuplicateCount > 1

score 0 · Accepted Answer

 select FamilyName, FirstName, DOB 
   from Patients 
  group by FamilyName, FirstName, DOB 
 having count(*)>1

将显示所有重复项。

但是，请考虑名称的写法不同，但相似。您可能想要查找主题“重复数据删除”和/或“记录链接”。我使用字符串相似性算法（修改 Jaro/Winkler 和 levenshtein）解决了这个问题。

score 0 · Accepted Answer

我建议您在用于检测重复项的 3 个字段上建立索引，然后尝试以下查询：

with Duplicates as
( 
    select FamilyName, FirstName, DOB
    from Patients 
    group by FamilyName, FirstName, DOB
    having count(*) > 1 
)
Select Patients.* 
from Patients 
    inner join Duplicates
    on Patients.FamilyName = Duplicates.FamilyName 
       And Patients.FirstName= Duplicates.FirstName
       and Patients.DOB= Duplicates.DOB

score 0 · Accepted Answer

如果我在你的鞋子里，我会做以下事情：

为 FamilyName、FirstName 和 DOB 添加索引
为您的子查询创建视图
修改查询如下

Select p.* FROM Patients p INNER JOIN view_name v ON v.FirstName=p.Firstname AND ...

score 0 · Accepted Answer

这将根据名字和姓氏输出具有重复项的每一行

SELECT DISTINCT t1.* 
FROM Table AS t1 
    INNER JOIN Table AS t2
    ON t1.firstname = t2.firstname 
       AND t1.lastname = t2.lastname
       AND t1.id <> t2.id

sql - Listing duplicated records using T SQL

5 回答 5

Related

Reference