0

我有一个维度用户表,不幸的是有一堆重复的记录。见截图。 重复记录。

我有成千上万的用户和 5 个引用重复项的表。我想删除“坏”的记录UserID的记录。我想通过 5 个依赖项并UserId用“好”(红色圈出)更新 bad s。

对此有什么好的方法?

这是我为获得上述屏幕截图所做的工作......

SELECT UserID
    ,userIds.FirstName
    ,userIds.LastName
    ,dupTable.Email
    ,dupTable.Username
    ,dupTable.DupCount
FROM dbo.DimUsers AS userIds
LEFT OUTER JOIN
    (SELECT FirstName
        ,LastName
        ,Email
        ,UserName
        ,DupCount
    FROM
        (SELECT FirstName
            ,LastName
            ,UserName
            ,Email
            ,COUNT(*) AS DupCount -- we're finding duplications by matches on FirstName,
                                    -- last name, UserName AND Email.  All four fields must match
                                    -- to find a dupe.  More confidence from this.
        FROM dbo.DimUsers
        GROUP BY FirstName
            ,LastName
            ,UserName
            ,Email
        HAVING COUNT(*) > 1) AS userTable -- any count more than 1 is a dupe
        WHERE LastName NOT LIKE 'NULL' -- exclude entries with literally NULL names
            AND FirstName NOT LIKE 'NULL'
        )AS dupTable
ON dupTable.FirstName = userIds.FirstName -- to get the userIds of dupes, we LEFT JOIN the original table
    AND dupTable.LastName = userIds.LastName -- on four fields to increase our confidence
    AND dupTable.Email = userIds.Email
    AND dupTable.Username = userIds.Username
WHERE DupCount IS NOT NULL -- ignore NULL dupcounts, these are not dupes
4

1 回答 1

0

此代码应该可以工作,为 1 个依赖表创建,但您可以使用相同的逻辑来更新其他 4 个表。

update t
set UserID = MinUserID.UserID
from
  DimUsersChild1 t
  inner join DimUsers on DimUsers.UserID = t.UserID
  inner join (
              select min(UserID) UserID, FirstName, LastName, UserName, Email
              from DimUsers
              group by
                FirstName, LastName, UserName, Email
              ) MinUserID on 
                          MinUserID.FirstName = DimUsers.FirstName and
                          MinUserID.LastName = DimUsers.LastName and
                          MinUserID.UserName = DimUsers.UserName and
                          MinUserID.Email = DimUsers.Email

select * from DimUsersChild1;

delete t1
from
  DimUsers t
  inner join DimUsers t1 on t1.FirstName = t.FirstName and
                            t1.LastName = t.LastName and
                            t1.UserName = t.UserName and
                            t1.Email = t.Email
where
t.UserID < t1.UserID


select * from DimUsers;

这是一个工作演示

于 2013-06-14T03:51:19.277 回答