sql - 删除 postgres 中的孤立记录。使用连接删除。表现

Question

我有一个需要定期清理孤儿表的情况，所以我正在寻找一个高性能的解决方案。我尝试使用“IN”子句，但速度不是很快。列在两个表中都有所有必需的索引。（id - 主键，component_id - 索引，component_type - 索引）

DELETE FROM component_apportionment
WHERE id in (
        SELECT a.id
        FROM component_apportionment a
        LEFT JOIN component_live c
        ON (c.component_id = a.component_id
                AND
                c.component_type = a.component_type)
        WHERE c.id is null);

基本上情况是从“component_apportionment”表中删除“ component_live ”表中不存在的记录。

上面查询的查询计划也很糟糕：

Delete on component_apportionment_copy1   (cost=3860927.55..3860929.09 rows=1 width=18) (actual  time=183479.848..183479.848 rows=0 loops=1)
->  Nested Loop  (cost=3860927.55..3860929.09 rows=1 width=18) (actual time=183479.811..183479.813 rows=1 loops=1)
    ->  HashAggregate  (cost=3860927.12..3860927.13 rows=1 width=20) (actual time=183479.793..183479.793 rows=1 loops=1)
          Group Key: a.id
          ->  Merge Right Join  (cost=3753552.72..3860927.12 rows=1 width=20) (actual time=172941.125..183479.787 rows=1 loops=1)
                Merge Cond: ((c.component_id = a.component_id) AND ((c.component_type)::text = (a.component_type)::text))
                Filter: (c.id IS NULL)
                Rows Removed by Filter: 5968195
                ->  Sort  (cost=3390767.32..3413658.29 rows=9156391 width=21) (actual time=169852.438..172642.897 rows=8043013 loops=1)
                      Sort Key: c.component_id, c.component_type
                      Sort Method: external merge  Disk: 310232kB
                      ->  Seq Scan on component_live c  (cost=0.00..2117393.91 rows=9156391 width=21) (actual time=0.004..155656.568 rows=9333382 loops=1)
                ->  Materialize  (cost=362785.40..375049.75 rows=2452871 width=21) (actual time=3088.653..5343.013 rows=5968195 loops=1)
                      ->  Sort  (cost=362785.40..368917.58 rows=2452871 width=21) (actual time=3088.648..3989.163 rows=2452871 loops=1)
                            Sort Key: a.component_id, a.component_type
                            Sort Method: external merge  Disk: 81504kB
                            ->  Seq Scan on component_apportionment_copy1 a  (cost=0.00..44969.71 rows=2452871 width=21) (actual time=0.920..882.040 rows=2452871 loops=1)
    ->  Index Scan using component_apportionment_copy1_pkey on component_apportionment_copy1  (cost=0.43..1.95 rows=1 width=14) (actual time=0.012..0.012 rows=1 loops=1)
          Index Cond: (id = a.id)
Planning time: 5.573 ms
Execution time: 183554.675 ms

将不胜感激任何帮助。谢谢

笔记

在最坏的情况下，每个表大约有 8000 万条记录。两个表都有已用列的索引。

更新

“不存在”的查询计划

询问：

EXPLAIN (analyze, verbose, buffers) DELETE FROM component_apportionment_copy1
WHERE not exists (select 1
    from component_live c
    where c.component_id = component_apportionment_copy1.component_id);


Delete on vector.component_apportionment_copy1  (cost=2276557.80..2446287.39 rows=2104532 width=12) (actual time=203643.560..203643.560 rows=0 loops=1)
Buffers: shared hit=20875 read=2025400, temp read=46067 written=45813
->  Hash Anti Join  (cost=2276557.80..2446287.39 rows=2104532 width=12) (actual time=202212.975..203643.486 rows=1 loops=1)
    Output: component_apportionment_copy1.ctid, c.ctid
    Hash Cond: (component_apportionment_copy1.component_id = c.component_id)
    Buffers: shared hit=20874 read=2025400, temp read=46067 written=45813
    ->  Seq Scan on vector.component_apportionment_copy1  (cost=0.00..44969.71 rows=2452871 width=10) (actual time=0.003..659.668 rows=2452871 loops=1)
          Output: component_apportionment_copy1.ctid, component_apportionment_copy1.component_id
          Buffers: shared hit=20441
    ->  Hash  (cost=2117393.91..2117393.91 rows=9156391 width=10) (actual time=198536.786..198536.786 rows=9333382 loops=1)
          Output: c.ctid, c.component_id
          Buckets: 16384  Batches: 128  Memory Usage: 3195kB
          Buffers: shared hit=430 read=2025400, temp written=36115
          ->  Seq Scan on vector.component_live c  (cost=0.00..2117393.91 rows=9156391 width=10) (actual time=0.039..194415.641 rows=9333382 loops=1)
                Output: c.ctid, c.component_id
                Buffers: shared hit=430 read=2025400
Planning time: 6.639 ms
Execution time: 203643.594 ms

它对两个表和更多数据进行 seq 扫描 - 它会越慢。

score 14 · Accepted Answer

你有太多的连接在那里：

set enable_seqscan = false; -- forcing to use indexes

DELETE FROM component_apportionment
WHERE not exists (select 1
                  from component_live c
                  where c.component_id = component_apportionment.component_id);

会做同样的事情并且应该更快，特别是如果您在component_id列上有索引。

score 1 · Accepted Answer

exists方式：

delete 
from   component_apportionment ca
where  not exists 
       (select 1
        from component_live cl
        where cl.component_id = ca.component_id
       );

或者in方式：

delete 
from   component_apportionment
where  component_id not in 
       (select component_id
        from component_live
       );

component_id此外，在两个表上为列创建索引。

更新

我制作了一个测试脚本：

-- table creating and populating (1,000,000 records each)
drop table if exists component_apportionment;
drop table if exists component_live;
create table component_live (component_id numeric primary key);
create table component_apportionment (id serial primary key, component_id numeric);
create index component_apportionment_idx on component_apportionment (component_id);
insert into component_live select g from generate_series(1,1000000) g;
insert into component_apportionment (component_id) select trunc(random()*1000000) from generate_series(1,1000000) g;
analyze verbose component_live;
analyze verbose component_apportionment;



EXPLAIN (analyze, verbose, buffers)
select component_id
from   component_apportionment ca
where  not exists 
       (select 1
        from component_live cl
        where cl.component_id = ca.component_id
       );


Merge Anti Join  (cost=0.85..61185.85 rows=1 width=6) (actual time=0.013..1060.014 rows=2 loops=1)
  Output: ca.component_id
  Merge Cond: (ca.component_id = cl.component_id)
  Buffers: shared hit=1010548
  ->  Index Only Scan using component_apportionment_idx on admin.component_apportionment ca  (cost=0.42..24015.42 rows=1000000 width=6) (actual time=0.006..460.318 rows=1000000 loops=1)
        Output: ca.component_id
        Heap Fetches: 1000000
        Buffers: shared hit=1003388
  ->  Index Only Scan using component_live_pkey on admin.component_live cl  (cost=0.42..22170.42 rows=1000000 width=6) (actual time=0.005..172.502 rows=999998 loops=1)
        Output: cl.component_id
        Heap Fetches: 999998
        Buffers: shared hit=7160
Total runtime: 1060.035 ms

sql - 删除 postgres 中的孤立记录。使用连接删除。表现

笔记

更新

“不存在”的查询计划

2 回答 2

Related

Reference