postgresql - 删除大型 postgresql 数据库表中的重复行

Question

我有一个 100 GB 大小的 postgresql 数据库。其中一张表有大约十亿个条目。为了快速输入数据，一些数据被重复并留待以后修剪。其中一列可用于将行标识为唯一的。

我发现了这个stackoverflow问题，它为mysql提出了一个解决方案：

ALTER IGNORE TABLE table_name ADD UNIQUE (location_id, datetime)

postgresql有类似的东西吗？

我尝试使用 group by 和 row number 删除，在这两种情况下，我的计算机在几个小时后都会耗尽内存。

这是我尝试估计表中的行数时得到的结果：

SELECT reltuples FROM pg_class WHERE relname = 'orders';
  reltuples  
-------------
 4.38543e+08
(1 row)

score 1 · Accepted Answer

立即想到两个解决方案：

1）。使用 WHERE 子句创建一个新表作为 select * from source table 以确定唯一行。添加索引以匹配源表，然后在事务中重命名它们。这是否对您有用取决于几个因素，包括可用磁盘空间量、表是否在持续使用以及是否允许访问中断等。创建新表的好处是可以紧密打包数据和索引，并且由于省略了非唯一行，因此该表将小于原始表。

2）。在列上创建部分唯一索引并添加 WHERE 子句以过滤掉非唯一索引。例如：

test=# create table t ( col1 int, col2 int, is_unique boolean);
CREATE TABLE

test=# insert into t values (1,2,true), (2,3,true),(2,3,false);
INSERT 0 3

test=# create unique index concurrently t_col1_col2_uidx on t (col1, col2) where is_unique is true;
CREATE INDEX

test=# \d t
        Table "public.t"
  Column   |  Type   | Modifiers 
-----------+---------+-----------
 col1      | integer | 
 col2      | integer | 
 is_unique | boolean | 
Indexes:
    "t_col1_col2_uidx" UNIQUE, btree (col1, col2) WHERE is_unique IS TRUE

postgresql - 删除大型 postgresql 数据库表中的重复行

1 回答 1

Related

Reference