我有一个维度用户表,不幸的是有一堆重复的记录。见截图。
我有成千上万的用户和 5 个引用重复项的表。我想删除“坏”的记录UserID
的记录。我想通过 5 个依赖项并UserId
用“好”(红色圈出)更新 bad s。
对此有什么好的方法?
这是我为获得上述屏幕截图所做的工作......
SELECT UserID
,userIds.FirstName
,userIds.LastName
,dupTable.Email
,dupTable.Username
,dupTable.DupCount
FROM dbo.DimUsers AS userIds
LEFT OUTER JOIN
(SELECT FirstName
,LastName
,Email
,UserName
,DupCount
FROM
(SELECT FirstName
,LastName
,UserName
,Email
,COUNT(*) AS DupCount -- we're finding duplications by matches on FirstName,
-- last name, UserName AND Email. All four fields must match
-- to find a dupe. More confidence from this.
FROM dbo.DimUsers
GROUP BY FirstName
,LastName
,UserName
,Email
HAVING COUNT(*) > 1) AS userTable -- any count more than 1 is a dupe
WHERE LastName NOT LIKE 'NULL' -- exclude entries with literally NULL names
AND FirstName NOT LIKE 'NULL'
)AS dupTable
ON dupTable.FirstName = userIds.FirstName -- to get the userIds of dupes, we LEFT JOIN the original table
AND dupTable.LastName = userIds.LastName -- on four fields to increase our confidence
AND dupTable.Email = userIds.Email
AND dupTable.Username = userIds.Username
WHERE DupCount IS NOT NULL -- ignore NULL dupcounts, these are not dupes