2

这个用于创建候选重复列表的查询非常简单:

SELECT Count(*), Can_FName, Can_HPhone, Can_EMail
FROM Can 
GROUP BY Can_FName, Can_HPhone, Can_EMail
HAVING Count(*) > 1

但是,如果我要检查的实际规则是 FName 和 (HPhone OR Email) - 我如何调整 GROUP BY 以使用它?

我很确定我会在这里得到一个 UNION SELECT(即 FName、HPhone 和 FName、EMail 并结合结果) - 但我很想知道是否有人知道更简单方法来做到这一点。

预先感谢您的任何帮助。

斯科特在缅因州

4

7 回答 7

3

在我提出任何建议之前,我需要知道这个问题的答案:

name  phone      email

John  555-00-00  john@example.com
John  555-00-01  john@example.com
John  555-00-01  john-other@example.com

COUNT(*)想要这个数据是什么?

更新:

如果您只想知道记录有任何重复项,请使用以下命令:

WITH    q AS (
        SELECT  1 AS id, 'John' AS name, '555-00-00' AS phone, 'john@example.com' AS email
        UNION ALL
        SELECT  2 AS id, 'John', '555-00-01', 'john@example.com'
        UNION ALL
        SELECT  3 AS id, 'John', '555-00-01', 'john-other@example.com'
        UNION ALL
        SELECT  4 AS id, 'James', '555-00-00', 'james@example.com'
        UNION ALL
        SELECT  5 AS id, 'James', '555-00-01', 'james-other@example.com'
        )
SELECT  *
FROM    q qo
WHERE   EXISTS
        (
        SELECT  NULL
        FROM    q qi
        WHERE   qi.id <> qo.id
                AND qi.name = qo.name
                AND (qi.phone = qo.phone OR qi.email = qo.email)
        )

它更有效,但不会告诉您重复链从哪里开始。

此查询选择所有条目以及特殊字段 ,chainid指示重复链的开始位置。

WITH    q AS (
        SELECT  1 AS id, 'John' AS name, '555-00-00' AS phone, 'john@example.com' AS email
        UNION ALL
        SELECT  2 AS id, 'John', '555-00-01', 'john@example.com'
        UNION ALL
        SELECT  3 AS id, 'John', '555-00-01', 'john-other@example.com'
        UNION ALL
        SELECT  4 AS id, 'James', '555-00-00', 'james@example.com'
        UNION ALL
        SELECT  5 AS id, 'James', '555-00-01', 'james-other@example.com'
        ),
        dup AS (
        SELECT  id AS chainid, id, name, phone, email, 1 as d
        FROM    q
        UNION ALL
        SELECT  chainid, qo.id, qo.name, qo.phone, qo.email, d + 1
        FROM    dup
        JOIN    q qo
        ON      qo.name = dup.name
                AND (qo.phone = dup.phone OR qo.email = dup.email)
                AND qo.id > dup.id
        ),
        chains AS 
        (
        SELECT  *
        FROM    dup do
        WHERE   chainid NOT IN
                (
                SELECT  id
                FROM    dup di
                WHERE   di.chainid < do.chainid
                )
        )
SELECT  *
FROM    chains
ORDER BY
        chainid
于 2009-07-02T16:15:57.403 回答
0

GROUP BY 不支持 OR - 它是隐式 AND 并且必须在选择列表中包含每个非聚合器。

于 2009-07-02T16:15:55.783 回答
0

我假设您还有一个唯一的 ID 整数作为此表的主键。如果你不这样做,最好有一个,用于此目的和许多其他目的。

通过自连接查找这些重复项:

select
  c1.ID 
, c1.Can_FName
, c1.Can_HPhone
, c1.Can_Email
, c2.ID 
, c2.Can_FName
, c2.Can_HPhone
, c2.Can_Email
from
(
  select 
      min(ID), 
      Can_FName, 
      Can_HPhone, 
      Can_Email 
  from Can 
  group by 
      Can_FName, 
      Can_HPhone, 
      Can_Email
) c1
inner join Can c2 on c1.ID < c2.ID 
where
    c1.Can_FName = c2.Can_FName 
and (c1.Can_HPhone = c2.Can_HPhone OR c1.Can_Email = c2.Can_Email)
order by
  c1.ID

该查询为每个 N 重复组合提供 N-1 行 - 如果您只想对每个唯一组合进行计数,请计算按“左侧”分组的行:

select count(1) + 1,
, c1.Can_FName
, c1.Can_HPhone
, c1.Can_Email
from 
(
  select 
      min(ID), 
      Can_FName, 
      Can_HPhone, 
      Can_Email 
  from Can 
  group by 
      Can_FName, 
      Can_HPhone, 
      Can_Email
) c1
inner join Can c2 on c1.ID < c2.ID 
where
    c1.Can_FName = c2.Can_FName 
and (c1.Can_HPhone = c2.Can_HPhone OR c1.Can_Email = c2.Can_Email)
group by 
  c1.Can_FName
, c1.Can_HPhone
, c1.Can_Email

当然,这比联合更复杂——但我认为它说明了一种思考重复的好方法。

于 2009-07-02T17:01:56.523 回答
0

首先从派生表中投影所需的转换,然后进行聚合:

SELECT COUNT(*) 
    , CAN_FName
    , Can_HPhoneOrEMail
    FROM (
        SELECT Can_FName 
            , ISNULL(Can_HPhone,'') +  ISNULL(Can_EMail,'')  AS Can_HPhoneOrEMail
        FROM Can) AS Can_Transformed
    GROUP BY Can_FName, Can_HPhoneOrEMail
    HAVING Count(*) > 1

根据需要在派生表项目列表中调整您的“或”操作。

于 2009-07-02T17:03:09.673 回答
0

我知道这个答案会因使用临时表而受到批评,但无论如何它都会起作用:

-- create temp table to give the table a unique key
create table #tmp(
ID int identity,
can_Fname varchar(200) null, -- real type and len here
can_HPhone varchar(200) null, -- real type and len here
can_Email varchar(200) null, -- real type and len here
)

-- just copy the rows where a duplicate fname exits 
-- (better performance specially for a big table)
insert into #tmp 
select can_fname,can_hphone,can_email
from Can 
where can_fname exists in (select can_fname from Can 
group by can_fname having count(*)>1)

-- select the rows that have the same fname and 
-- at least the same phone or email
select can_Fname, can_Hphone, can_Email  
from #tmp a where exists
(select * from #tmp b where
a.ID<>b.ID and A.can_fname = b.can_fname
and (isnull(a.can_HPhone,'')=isnull(b.can_HPhone,'')
or  (isnull(a.can_email,'')=isnull(b.can_email,'') )
于 2009-07-02T17:37:25.963 回答
0

试试这个:

SELECT Can_FName, COUNT(*)
FROM (
SELECT 
rank() over(partition by Can_FName order by  Can_FName,Can_HPhone) rnk_p,
rank() over(partition by Can_FName order by  Can_FName,Can_EMail) rnk_m,
Can_FName
FROM Can
) X
WHERE rnk_p=1 or rnk_m =1
GROUP BY Can_FName
HAVING COUNT(*)>1
于 2009-07-02T19:47:46.903 回答
0

这些答案都不是正确的。Quassnoi 是一种不错的方法,但您会注意到表达式“qo.id > dup.id”和“di.chainid < do.chainid”中的一个致命缺陷:通过 ID 进行比较!这总是不好的做法,因为它取决于 ID 中的某些固有顺序。绝不应赋予 ID 任何隐含含义,并且仅应参与相等或空测试。通过简单地重新排序数据中的 ID,您可以轻松地打破该示例中的 Quassnoi 解决方案。

基本问题是分组的分离条件,这导致两个记录可能通过中间体相关,尽管它们不是直接相关的。

例如,您说这些记录应该全部分组:

(1) 约翰 555-00-00 john@example.com

(2) 约翰 555-00-01 john@example.com

(3) 约翰 555-00-01 john-other@example.com

你可以看到#1 和#2 是相关的,#2 和#3 也是相关的,但显然#1 和#3 作为一个组不能直接相关。

这确定了递归或迭代解决方案是唯一可能的解决方案。

因此,递归是不可行的,因为您很容易陷入循环情况。这是 Quassnoi 试图通过 ID 比较来避免的,但这样做他破坏了算法。您可以尝试限制递归级别,但您可能无法完成所有关系,并且您仍然可能会跟随循环回到自己身上,导致数据量过大和效率低下。

最好的解决方案是迭代:通过将每个 ID 标记为唯一的组 ID 来启动结果集,然后遍历结果集并更新它,将 ID 组合成相同的唯一组 ID,因为它们在分离条件上匹配。每次对更新集重复该过程,直到无法进行进一步更新。

我将很快为此创建示例代码。

于 2010-10-19T14:50:57.323 回答