postgresql - 选择 ID 包含在另一个表中的记录

Question

vit=# select count(*) from evtags;
  count  
---------
 4496914

vit=# explain select tag from evtags where evid in (1002, 1023);
                              QUERY PLAN                                    
---------------------------------------------------------------------------------
 Index Only Scan using evtags_pkey on evtags  (cost=0.00..15.64 rows=12 width=7)
   Index Cond: (evid = ANY ('{1002,1023}'::integer[]))

到目前为止，这似乎完全没问题。接下来，我想使用另一个表中的 ID，而不是在查询中指定它们。

vit=# select count(*) from zzz;
 count 
-------
 49738

开始了...

vit=# explain select tag from evtags where evid in (select evid from zzz);
                              QUERY PLAN                               
-----------------------------------------------------------------------
 Hash Semi Join  (cost=1535.11..142452.47 rows=291712 width=7)
   Hash Cond: (evtags.evid = zzz.evid)
   ->  Seq Scan on evtags  (cost=0.00..69283.14 rows=4496914 width=11)
   ->  Hash  (cost=718.38..718.38 rows=49738 width=4)
         ->  Seq Scan on zzz  (cost=0.00..718.38 rows=49738 width=4)

为什么要对更大的表进行索引扫描，这样做的正确方法是什么？

编辑

我重新创建了我的zzz表，现在由于某种原因更好：

vit=# explain analyze select tag from evtags where evid in (select evid from zzz);
                                                         QUERY PLAN                                                             
------------------------------------------------------------------------------------------------------------------------------------
 Nested Loop  (cost=708.00..2699.17 rows=2248457 width=7) (actual time=28.935..805.923 rows=244353 loops=1)
   ->  HashAggregate  (cost=708.00..710.00 rows=200 width=4) (actual time=28.893..54.461 rows=38822 loops=1)
         ->  Seq Scan on zzz  (cost=0.00..601.80 rows=42480 width=4) (actual time=0.032..10.985 rows=40000 loops=1)
   ->  Index Only Scan using evtags_pkey on evtags  (cost=0.00..9.89 rows=6 width=11) (actual time=0.015..0.017 rows=6 loops=38822)
         Index Cond: (evid = zzz.evid)
         Heap Fetches: 0
 Total runtime: 825.651 ms

但经过几次处决后，它变为

vit=# explain analyze select tag from evtags where evid in (select evid from zzz);
                                                                    QUERY PLAN                                                                     
---------------------------------------------------------------------------------------------------------------------------------------------------
 Merge Semi Join  (cost=4184.11..127258.48 rows=235512 width=7) (actual time=38.269..1461.755 rows=244353 loops=1)
   Merge Cond: (evtags.evid = zzz.evid)
   ->  Index Only Scan using evtags_pkey on evtags  (cost=0.00..136736.89 rows=4496914 width=11) (actual time=0.038..899.647 rows=3630070 loops=1)
         Heap Fetches: 0
   ->  Materialize  (cost=4184.04..4384.04 rows=40000 width=4) (actual time=38.212..61.038 rows=40000 loops=1)
         ->  Sort  (cost=4184.04..4284.04 rows=40000 width=4) (actual time=38.208..51.104 rows=40000 loops=1)
               Sort Key: zzz.evid
               Sort Method: external sort  Disk: 552kB
               ->  Seq Scan on zzz  (cost=0.00..577.00 rows=40000 width=4) (actual time=0.018..8.833 rows=40000 loops=1)
 Total runtime: 1484.293 ms

...实际上更慢。有没有办法暗示它是一个“正确的”执行计划？

这些操作的重点是我想对我的数据子集执行多次查询，并想使用单独的临时表来保存我想要处理的记录的 ID。

score 1 · Accepted Answer

内部连接更有可能制定一个好的计划：

select e.tag
from
    evtags e
    inner join
    zzz z using (evid)

或这个：

select e.tag
from evtags e
where exists (
    select 1
    from zzz
    where evid = e.evid
)

正如评论中指出的那样analyze evtags; analyze zzz;

postgresql - 选择 ID 包含在另一个表中的记录

1 回答 1

Related

Reference