1

假设我正在创建一个地址簿,其中主表包含基本联系信息和电话号码子表 -

Contact
===============
Id         [PK]
Name

PhoneNumber
===============
Id         [PK]
Contact_Id [FK]
Number

因此,联系人记录在 PhoneNumber 表中可能有零个或多个相关记录。对除主键之外的任何列的唯一性没有任何限制。事实上,这一定是真的,因为:

  1. 两个姓名不同的联系人可以共享一个电话号码,并且
  2. 两个联系人可能有相同的姓名但不同的电话号码。

我想将一个可能包含重复记录的大型数据集导入我的数据库,然后使用 SQL 过滤掉重复项。识别重复记录的规则很简单……它们必须共享相同的名称和相同数量的具有相同内容的电话记录。

当然,这对于从联系人表中选择重复项非常有效,但根据我的规则并不能帮助我检测实际的重复项:

SELECT * FROM Contact
WHERE EXISTS
    (SELECT 'x' FROM Contact t2 
     WHERE t2.Name = Contact.Name AND
           t2.Id > Contact.Id);

似乎我想要的是对我已经拥有的东西的逻辑扩展,但我必须忽略它。有什么帮助吗?

谢谢!

4

3 回答 3

1

在我的问题中,我创建了一个非常简化的模式,它反映了我正在解决的现实问题。Przemyslaw 的答案确实是一个正确的答案,并且使用示例模式以及在扩展时使用真实模式都做了我所要求的。

但是,在对真实模式和更大(约 1 万条记录)数据集进行了一些实验之后,我发现性能是一个问题。我并没有声称自己是索引专家,但我找不到比模式中已有的更好的索引组合。

所以,我想出了一个替代解决方案,它满足相同的要求,但执行时间只占一小部分(< 10%),至少使用 SQLite3 - 我的生产引擎。希望它可以帮助其他人,我将提供它作为我问题的替代答案。

DROP TABLE IF EXISTS Contact;
DROP TABLE IF EXISTS PhoneNumber;

CREATE TABLE Contact (
  Id    INTEGER PRIMARY KEY,
  Name  TEXT
);

CREATE TABLE PhoneNumber (
  Id          INTEGER PRIMARY KEY,
  Contact_Id  INTEGER REFERENCES Contact (Id) ON UPDATE CASCADE ON DELETE CASCADE,
  Number      TEXT
);

INSERT INTO Contact (Id, Name) VALUES
  (1, 'John Smith'),
  (2, 'John Smith'),
  (3, 'John Smith'),
  (4, 'Jane Smith'),
  (5, 'Bob Smith'),
  (6, 'Bob Smith');

INSERT INTO PhoneNumber (Id, Contact_Id, Number) VALUES
  (1, 1, '555-1212'),
  (2, 1, '222-1515'),
  (3, 2, '222-1515'),
  (4, 2, '555-1212'),
  (5, 3, '111-2525'),
  (6, 4, '111-2525');

COMMIT;

SELECT *
FROM Contact c1
WHERE EXISTS (
  SELECT 1
  FROM Contact c2
  WHERE c2.Id > c1.Id
    AND c2.Name = c1.Name
    AND (SELECT COUNT(*) FROM PhoneNumber WHERE Contact_Id = c2.Id) = (SELECT COUNT(*) FROM PhoneNumber WHERE Contact_Id = c1.Id)
    AND (
      SELECT COUNT(*)
      FROM PhoneNumber p1
      WHERE p1.Contact_Id = c2.Id
        AND EXISTS (
          SELECT 1
          FROM PhoneNumber p2
          WHERE p2.Contact_Id = c1.Id
            AND p2.Number = p1.Number
        )
    ) = (SELECT COUNT(*) FROM PhoneNumber WHERE Contact_Id = c1.Id)
)
;

结果如预期:

Id     Name
====== =============
1      John Smith
5      Bob Smith

其他引擎必然具有不同的性能,这可能是完全可以接受的。对于这个模式,这个解决方案似乎与 SQLite 配合得很好。

于 2013-10-02T15:45:25.027 回答
0

关键字“拥有”是你的朋友。一般用途是:

select field1, field2, count(*) records
from whereever
where whatever
group by field1, field2
having records > 1

是否可以在 having 子句中使用别名取决于数据库引擎。您应该能够将这一基本原则应用于您的情况。

于 2013-09-30T19:43:35.267 回答
0

作者将“两个人是同一个人”的要求表述为:

  1. 拥有相同的名字和
  2. 拥有相同数量的电话号码并且所有电话号码都相同。

所以这个问题比看起来要复杂一些(或者也许我只是想多了)。

示例数据和(一个丑陋的,我知道,但总的想法是存在的)一个示例查询,我在下面的测试数据上进行了测试,它似乎工作正常(我使用的是 Oracle 11g R2):

CREATE TABLE contact (
  id NUMBER PRIMARY KEY,
  name VARCHAR2(40))
;

CREATE TABLE phone_number (
  id NUMBER PRIMARY KEY,
  contact_id REFERENCES contact (id),
  phone VARCHAR2(10)
);

INSERT INTO contact (id, name) VALUES (1, 'John');
INSERT INTO contact (id, name) VALUES (2, 'John');
INSERT INTO contact (id, name) VALUES (3, 'Peter');
INSERT INTO contact (id, name) VALUES (4, 'Peter');
INSERT INTO contact (id, name) VALUES (5, 'Mike');
INSERT INTO contact (id, name) VALUES (6, 'Mike');
INSERT INTO contact (id, name) VALUES (7, 'Mike');

INSERT INTO phone_number (id, contact_id, phone) VALUES (1, 1, '123'); -- John having number 123
INSERT INTO phone_number (id, contact_id, phone) VALUES (2, 1, '456'); -- John having number 456

INSERT INTO phone_number (id, contact_id, phone) VALUES (3, 2, '123'); -- John the second having number 123
INSERT INTO phone_number (id, contact_id, phone) VALUES (4, 2, '456'); -- John the second having number 456

INSERT INTO phone_number (id, contact_id, phone) VALUES (5, 3, '123'); -- Peter having number 123
INSERT INTO phone_number (id, contact_id, phone) VALUES (6, 3, '456'); -- Peter having number 123
INSERT INTO phone_number (id, contact_id, phone) VALUES (7, 3, '789'); -- Peter having number 123

INSERT INTO phone_number (id, contact_id, phone) VALUES (8, 4, '456'); -- Peter the second having number 456

INSERT INTO phone_number (id, contact_id, phone) VALUES (9, 5, '123'); -- Mike having number 456
INSERT INTO phone_number (id, contact_id, phone) VALUES (10, 5, '456'); -- Mike having number 456

INSERT INTO phone_number (id, contact_id, phone) VALUES (11, 6, '123'); -- Mike the second having number 456
INSERT INTO phone_number (id, contact_id, phone) VALUES (12, 6, '789'); -- Mike the second having number 456

-- Mike the third having no number
COMMIT;

-- does not meet the requirements described in the question - will return Peter when it should not
SELECT DISTINCT c.name
  FROM contact c JOIN phone_number pn ON (pn.contact_id = c.id)
GROUP BY name, phone_number
HAVING COUNT(c.id) > 1
;

-- returns correct results for provided test data
-- take all people that have a namesake in contact table and
-- take all this person's phone numbers that this person's namesake also has
-- finally (outer query) check that the number of both persons' phone numbers is the same and
-- the number of the same phone numbers is equal to the number of (either) person's phone numbers
SELECT c1_id, name
  FROM (
    SELECT c1.id AS c1_id, c1.name, c2.id AS c2_id, COUNT(1) AS cnt
      FROM contact c1
        JOIN contact c2 ON (c2.id != c1.id AND c2.name = c1.name)
        JOIN phone_number pn ON (pn.contact_id = c1.id)
    WHERE
      EXISTS (SELECT 1
                FROM phone_number
              WHERE contact_id = c2.id
                AND phone = pn.phone)
    GROUP BY c1.id, c1.name, c2.id
  )
WHERE cnt = (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id)
  AND (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) = (SELECT COUNT(1) FROM phone_number WHERE contact_id = c2_id)
;

-- cleanup
DROP TABLE phone_number;
DROP TABLE contact;

查看 SQL Fiddle:http ://www.sqlfiddle.com/#!4/36cdf/1

已编辑

回答作者的评论:当然我没有考虑到这一点......这是一个修改后的解决方案:

-- new test data
INSERT INTO contact (id, name) VALUES (8, 'Jane');
INSERT INTO contact (id, name) VALUES (9, 'Jane');

SELECT c1_id, name
  FROM (
    SELECT c1.id AS c1_id, c1.name, c2.id AS c2_id, COUNT(1) AS cnt
      FROM contact c1
        JOIN contact c2 ON (c2.id != c1.id AND c2.name = c1.name)
        LEFT JOIN phone_number pn ON (pn.contact_id = c1.id)
    WHERE pn.contact_id IS NULL
      OR EXISTS (SELECT 1
                FROM phone_number
              WHERE contact_id = c2.id
                AND phone = pn.phone)
    GROUP BY c1.id, c1.name, c2.id
  )
WHERE (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) IN (0, cnt)
  AND (SELECT COUNT(1) FROM phone_number WHERE contact_id = c1_id) = (SELECT COUNT(1) FROM phone_number WHERE contact_id = c2_id)
;

我们允许没有电话号码(LEFT JOIN)的情况,在外部查询中,我们现在比较人的电话号码的数量 - 它必须等于 0,或者是从内部查询返回的数字。

于 2013-09-30T20:43:11.333 回答