2

I created a table with the following fields:

Record:
Id                 int Primary Key, Auto Increment
ForeignId          int
IsDuplicateRecord bit NULL

Then I inserted some data:

INSERT INTO Record (ForeignId)
VALUES (5), (5), (1), (2), (3)

After that, I ran the following update statement, (found at http://archive.msdn.microsoft.com/SQLExamples/Wiki/View.aspx?title=DuplicateRows ):

UPDATE Record
SET IsDuplicateRecord = 1
WHERE Id IN (
    SELECT MAX(Id)
    FROM Record
    GROUP BY ForeignId
    HAVING COUNT(*) > 1
)

So far so good, the query affected one row, and the table now looks like this:

Id ForeignId IsDuplicateRecord
0  5         NULL
1  5         1
2  1         NULL
3  2         NULL
4  3         NULL

I was happy, because for a moment I thought everything was going to be just fine. But then a suspicion as dark as the clouds outside crossed my mind: Dreadingly, I typed

INSERT INTO Record (ForeignId)
VALUES (1), (1)

and ran the above query again, which this time yielded:

Id  ForeignId  IsDuplicateRecord
0   0          NULL
1   5          1
2   1          NULL
3   2          NULL
4   3          NULL
5   1          NULL
6   1          1

So I figured I'd head over to StackOverflow, and see who could explain to me why the IsDuplicatedRecord field in row with ID 5 wasn't updated to 1? Are you the one?

4

3 回答 3

5

Because the SQL you ran only marks the last of the duplicates as duplicates. Try this instead:

UPDATE Record
SET IsDuplicateRecord = 1
WHERE Id NOT IN (
    SELECT MIN(Id)
    FROM Record
    GROUP BY ForeignId
)

This marks second and subsequent occurences of each ForeignId as duplicates as I think you require.

于 2012-07-12T16:09:23.797 回答
1
UPDATE Record uu
SET IsDuplicateRecord = 1
   -- if there exists a record with the same foreignid
   -- but a lower id
   -- this (uu) is a duplicate
WHERE EXISTS (
    SELECT *
    FROM Record ex 
    WHERE ex.ForeignId = uu.ForeignId
    AND ex.Id < uu.Id
    );

There is a subtle (but rude) difference between this EXISTS (...) subquery and @DavidM 's NOT IN (...) subquery: The NOT IN will not yield NULL values, and if "ForeignId" happens to be NULL, the NOT IN version would be "True", resulting in setting all isDuplicateRecord flag for all tuples with ForeignId IS NULL. (I suspect ForeignId is a FK, so it could well be NULLable)

For not-nullable ForeignId, the two versions are basically the same.

UPDATE: as @MartinSmith pointed out, Some implementations don't like a UPDATE ... WHERE without a FROM clause. We can use a selfjoined dummy. (also updated the first query to normal)

-- DROP SCHEMA tmp CASCADE;
-- CREATE SCHEMA tmp ;
-- SET search_path='tmp';

DROP TABLE zrecord CASCADE;
CREATE TABLE zrecord
        ( id SERIAL NOT NULL PRIMARY KEY
        , foreign_id INTEGER -- REFERENCES zrecord(id)
        , is_duplicate boolean DEFAULT False
        );
SELECT * FROM zrecord;

INSERT INTO zrecord(foreign_id) VALUES(NULL),(1),(NULL),(1),(NULL),(2),(NULL);

SELECT * FROM zrecord;

EXPLAIN ANALYZE
UPDATE zrecord uu
SET is_duplicate = True
        --
        -- This selfjoin is needed if UPDATE ... WHERE needs a FROM TABLE
        --
FROM zrecord dum
WHERE  dum.id = uu.id
AND EXISTS (
    SELECT *
    FROM zrecord ex
    WHERE ex.foreign_id = uu.foreign_id
    AND ex.Id < uu.Id
    );

SELECT * FROM zrecord;

UPDATE2: the PARTITION BY suffers from the same nullability problem as the IN clause, so it seems:

WITH zcte AS (
    SELECT *
    , row_number() OVER (PARTITION BY foreign_id ORDER BY id) AS rn
    FROM   zrecord
    )
SELECT * FROM zcte;

RESULT: (the original testset, before any update)

 id | foreign_id | is_duplicate | rn 
----+------------+--------------+----
  2 |          1 | f            |  1
  4 |          1 | t            |  2
  6 |          2 | f            |  1
  1 |            | f            |  1
  3 |            | f            |  2
  5 |            | f            |  3
  7 |            | f            |  4
于 2012-07-12T17:47:10.100 回答
0

This has a lower estimated cost than either of the other two answers

;WITH CTE
     AS (SELECT *,
                Row_number() OVER (PARTITION BY ForeignId ORDER BY Id) AS RN
         FROM   Record)
UPDATE CTE
SET    IsDuplicateRecord = 1
WHERE  RN > 1 

Execution Plans

Plans

于 2012-07-15T12:05:38.910 回答