2

我有一个taccounts表,其中包含account_id(PK), login_name, password,等列last_login。现在我必须根据新的业务逻辑删除一些重复的条目。因此,重复帐户将具有相同email 相同 ( login_name& password)。必须保留最新登录的帐户。

这是我的尝试(一些电子邮件值为空和空白)

DELETE
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0 and last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
WHERE email is not null and char_length(trim(both ' ' from email))>0 
GROUP BY lower(trim(both ' ' from email)))

同样对于login_namepassword

DELETE
FROM taccounts
WHERE last_login NOT IN
(
SELECT MAX(last_login)
FROM taccounts
GROUP BY login_name, password)

有没有更好的方法或任何方法来组合这两个单独的查询?

还有一些其他表account_id作为外键。如何为这些表更新此更改?` 我正在使用 PostgreSQL 9.2.1

编辑:一些电子邮件值为空,其中一些为空白('')。因此,如果两个帐户的登录名和密码不同,并且他们的电子邮件为空或空白,则必须将它们视为两个不同的帐户。

4

2 回答 2

1

如果大部分行都被删除(大部分是重复的)并且表适合 RAM,请考虑以下路线:

  1. SELECT将幸存的行放入临时表中。
  2. 将 FK 引用重新路由到幸存者
  3. DELETE 基表中的所有行。
  4. 幸存者INSERT

1a。蒸馏幸存的行

CREATE TEMP TABLE tmp AS
SELECT DISTINCT ON (login_name, password) *
FROM  (
   SELECT DISTINCT ON (email) *
   FROM   taccounts
   ORDER  BY email, last_login DESC
   ) sub
ORDER  BY login_name, password, last_login DESC;

关于DISTINCT ON

要识别两个不同条件的重复项,请使用子查询一个接一个地应用这两个规则。第一步保留最新的帐户last_login,因此这是“可序列化的”。

检查结果并测试其合理性。

SELECT * FROM tmp;

临时表会在会话结束时自动删除。在 pgAdmin(您似乎正在使用)中,只要编辑器窗口打开,会话就会存在。

1b。“重复”的更新定义的替代查询

SELECT *
FROM   taccounts t
WHERE  NOT EXISTS (
   SELECT  FROM taccounts t1
   WHERE  ( NULLIF(t1.email, '') = t.email
        OR (NULLIF(t1.login_name, ''), NULLIF(t1.password, '')) = (t.login_name, t.password))
   AND   (t1.last_login, t1.account_id) > (t.last_login, t.account_id)
   );

这不会将任何“重复”列中的字符串 ( ) 视为相同NULL或将空字符串 ( '') 视为相同。

行表达式(t1.last_login, t1.account_id)处理了两个受骗者可能共享相同的可能性last_login。在这种情况下选择较大的那个account_id- 这是独一无二的,因为它是 PK。

2a。如何识别所有传入的 FK

SELECT c.confrelid::regclass::text AS referenced_table
     , c.conname AS fk_name
     , pg_get_constraintdef(c.oid) AS fk_definition
FROM   pg_attribute a 
JOIN   pg_constraint c ON (c.conrelid, c.conkey[1]) = (a.attrelid, a.attnum)
WHERE  c.confrelid = 'taccounts'::regclass   -- (schema-qualified) table name
AND    c.contype  = 'f'
ORDER  BY 1, contype DESC;

仅在外键的第一列上构建。更多关于:

或者Dependents在选择表格后在 pgAdmin 的对象浏览器的右侧窗口中检查骑手taccounts

2b。重新路由到新的主节点

如果您有表引用taccounts传入的外键 taccounts,您将需要在删除欺骗之前更新所有这些字段。
将它们全部重新路由到新的主行:

UPDATE referencing_tbl r
SET    referencing_column = tmp.reference_column
FROM   tmp
JOIN   taccounts t1 USING (email)
WHERE  r.referencing_column = t1.referencing_column
AND    referencing_column IS DISTINCT FROM tmp.reference_column;

UPDATE referencing_tbl r
SET    referencing_column = tmp.reference_column
FROM   tmp
JOIN   taccounts t2 USING (login_name, password)
WHERE  r.referencing_column = t1.referencing_column
AND    referencing_column IS DISTINCT FROM tmp.reference_column;

3. & 4. 杀戮

现在,不再引用骗子了。进入杀戮。

ALTER TABLE taccounts DISABLE TRIGGER ALL;
DELETE FROM taccounts;
VACUUM taccounts;
INSERT INTO taccounts
SELECT * FROM tmp;
ALTER TABLE taccounts ENABLE TRIGGER ALL;

在操作期间禁用所有触发器。这避免了在操作期间检查引用完整性。重新激活触发器后,一切都会好起来的。我们处理了上面所有传入的FK。传出的 FK 保证是健全的,因为您没有并发写入访问权限,并且所有值之前都存在。

于 2013-03-30T11:14:17.617 回答
1

除了 Erwin 的出色答案之外,在将旧键与新键相关联的中间链接表中创建通常很有用。

DROP SCHEMA tmp CASCADE;
CREATE SCHEMA tmp ;
SET search_path=tmp;
CREATE TABLE taccounts
        ( account_id SERIAL PRIMARY KEY
        , login_name varchar
        , email varchar
        , last_login TIMESTAMP
        );
    -- create some fake data
INSERT INTO taccounts(last_login)
SELECT gs FROM generate_series('2013-03-30 14:00:00' ,'2013-03-30 15:00:00' , '1min'::interval) gs
        ;
UPDATE taccounts
SET login_name = 'User_' || (account_id %10)::text
        , email = 'Joe' || (account_id %9)::text || '@somedomain.tld'
        ;

SELECT * FROM taccounts;

        --
        -- Create (temp) table linking old id <--> new id
        -- After inspection this table can be used as a source for the FK updates
        -- and for the final delete.
        --
CREATE TABLE update_ids AS
WITH pairs AS (
        SELECT one.account_id AS old_id
        , two.account_id AS new_id
        FROM taccounts one
        JOIN taccounts two ON two.last_login > one.last_login
                AND ( two.email = one.email OR two.login_name = one.login_name)
        )
SELECT old_id,new_id
FROM pairs pp
WHERE NOT EXISTS (
        SELECT * FROM pairs nx
        WHERE nx.old_id = pp.old_id
        AND nx.new_id > pp.new_id
        )
        ;

SELECT * FROM update_ids
        ;

UPDATE other_table_with_fk_to_taccounts dst
SET account_id. = ids.new_id
FROM update_ids ids
WHERE account_id. = ids.old_id
        ;
DELETE FROM taccounts del
WHERE EXISTS (
        SELECT * FROM update_ids ex
        WHERE ex.old_id = del.account_id
        );

SELECT * FROM taccounts;

另一种实现相同目的的方法是向表本身添加一个带有指向首选键的指针的列,并将其用于更新和删除。

ALTER TABLE taccounts
        ADD COLUMN better_id INTEGER REFERENCES taccounts(account_id)
        ;

   -- find the *better* records for each record.
UPDATE taccounts dst
SET better_id = src.account_id
FROM taccounts src
WHERE src.login_name = dst.login_name
AND src.last_login > dst.last_login
AND src.email IS NOT NULL
AND NOT EXISTS (
        SELECT * FROM taccounts nx
        WHERE nx.login_name = dst.login_name
        AND nx.email IS NOT NULL
        AND nx.last_login > src.last_login
        );

    -- Find records that *do* have an email address
UPDATE taccounts dst
SET better_id = src.account_id
FROM taccounts src
WHERE src.login_name = dst.login_name
AND src.email IS NOT NULL
AND dst.email IS NULL
AND NOT EXISTS (
        SELECT * FROM taccounts nx
        WHERE nx.login_name = dst.login_name
        AND nx.email IS NOT NULL
        AND nx.last_login > src.last_login
        );

SELECT * FROM taccounts ORDER BY account_id;

UPDATE other_table_with_fk_to_taccounts dst
SET account_id = src.better_id
FROM update_ids src
WHERE dst.account_id = src.account_id
AND src.better_id IS NOT NULL
        ;

DELETE FROM taccounts del
WHERE EXISTS (
        SELECT * FROM taccounts ex
        WHERE ex.account_id = del.better_id
        );
SELECT * FROM taccounts ORDER BY account_id;
于 2013-03-30T13:17:39.827 回答