sql - 带有 ID 列的重复电子邮件地址

Question

我的表格包含重复的电子邮件地址。每个电子邮件地址都有唯一的创建日期和唯一的 ID。我想用最近的创建日期及其关联 ID 来识别电子邮件地址，并显示重复的 ID 及其创建日期。我希望查询以下列格式显示：

第 1 列：电子邮件地址
第 2 列：IDKeep
第 3 列：CreateDateofIDKeep
第 4 列：重复 ID
第 5 列：CreateDateofDuplicateID

注意：在某些情况下，存在超过 2 个重复的电子邮件地址。我希望查询在新行上显示每个额外的重复项，在这些实例中重新说明 EmailAddress 和 IDKeep。

无济于事，我试图拼凑在这里找到的不同查询。我目前不知所措——任何帮助/指导将不胜感激。

score 1 · Accepted Answer

Complicated queries are best solved by breaking it up into pieces and working step-by-step.

First let's create a query to find the key of the row we want to keep, by finding the most recent create date for each email then joining to get the Id:

select x.Email, x.CreateDate, x.Id
from myTable x
join (
    select Email, max(CreateDate) as CreateDate
    from myTable
    group by Email
) y on x.Email = y.Email and x.CreateDate = y.CreateDate

Ok, now let's make a query to get duplicate email addresses:

select Email
from myTable
group by Email
having count(*) > 1

And join this query back to the table to get the keys for every row that has duplicates:

select x.Email, x.Id, x.CreateDate
from myTable x
join (
    select Email
    from myTable
    group by Email
    having count(*) > 1
) y on x.Email = y.Email

Great. Now all that is left is to join the first query with this one to get our result:

select keep.Email, keep.Id as IdKeep, keep.CreateDate as CreateDateOfIdKeep,
    dup.Id as DuplicateId, dup.CreateDate as CreateDateOfDuplicateId
from (
    select x.Email, x.CreateDate, x.Id
    from myTable x
    join (
        select Email, max(CreateDate) as CreateDate
        from myTable
        group by Email
    ) y on x.Email = y.Email and x.CreateDate = y.CreateDate
) keep
join (
    select x.Email, x.Id, x.CreateDate
    from myTable x
    join (
        select Email
        from myTable
        group by Email
        having count(*) > 1
    ) y on x.Email = y.Email
) dup on keep.Email = dup.Email and keep.Id <> dup.Id

Note the final keep.Id <> dup.Id predicate on the join ensures we don't get the same row for both keep and dup.

score 0 · Accepted Answer

以下子查询使用一个技巧来获取每封电子邮件的最新 ID 和创建日期：

select Email, max(CreateDate) as CreateDate,
       substring_index(group_concat(id order by CreateDate desc), ',', 1) as id
from myTable
group by Email
having count(*) > 1;

该having()条款还确保这仅适用于重复的电子邮件。

然后，只需将此查询与其余数据组合即可获得您想要的格式：

select t.Email, tkeep.id as keep_id, tkeep.CreateDate as keep_date,
       id as dup_id, CreateDate as dup_CreateDate
from myTable t join
     (select Email, max(CreateDate) as CreateDate,
             substring_index(group_concat(id order by CreateDate desc), ',', 1) as id
      from myTable
      group by Email
      having count(*) > 1
     ) tkeep
     on t.Email = tkeep.Email and t.CreateDate <> tkeep.CreateDate;

sql - 带有 ID 列的重复电子邮件地址

2 回答 2

Related

Reference