postgresql - PostgreSQL 什么时候将子查询折叠到连接，什么时候不？

Question

考虑以下查询：

select a.id from a
where
    a.id in (select b.a_id from b where b.x='x1' and b.y='y1') and
    a.id in (select b.a_id from b where b.x='x2' and b.y='y2')
order by a.date desc
limit 20

哪个应该可以重写为更快的：

select a.id from a inner join b as b1 on (a.id=b1.a_id) inner join b as b2 on (a.id=b2.a_id)
where
    b1.x='x1' and b1.y='y1' and
    b2.x='x2' and b2.y='y2'
order by a.date desc
limit 20

我们不希望通过更改我们的源代码来重写我们的查询，因为它很复杂（尤其是在使用 Django 时）。

因此，我们想知道 PostgreSQL 什么时候将子查询折叠到连接，什么时候不？

那是简化的数据模型：

                                      Table "public.a"
      Column       |          Type          |                          Modifiers
-------------------+------------------------+-------------------------------------------------------------
 id                | integer                | not null default nextval('a_id_seq'::regclass)
 date              | date                   | 
 content           | character varying(256) | 
Indexes:
    "a_pkey" PRIMARY KEY, btree (id)
    "a_id_date" btree (id, date)
Referenced by:
    TABLE "b" CONSTRAINT "a_id_refs_id_6e634433343d4435353" FOREIGN KEY (a_id) REFERENCES a(id) DEFERRABLE INITIALLY DEFERRED


       Table "public.b"
  Column  |   Type    | Modifiers 
----------+-----------+-----------
 a_id     | integer   | not null
 x        | text      | not null
 y        | text      | not null
Indexes:
    "b_x_y_a_id" UNIQUE CONSTRAINT, btree (x, y, a_id)
Foreign-key constraints:
    "a_id_refs_id_6e634433343d4435353" FOREIGN KEY (a_id) REFERENCES a(id) DEFERRABLE INITIALLY DEFERRED

a 有 700 万行
b 有 7000 万行
bx 的基数 = ~100
by = ~100000 的基数
bx 的基数，由 = ~150000
想象表 c、d 和 e 具有与 b 相同的结构，并且可以额外用于进一步减少生成的 a.ids

PostgreSQL 的版本，我们测试了查询。

PostgreSQL 9.2.7 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.7.2 20130108 [gcc-4_7-branch revision 195012], 64-bit
PostgreSQL 9.4beta1 on x86_64-suse-linux-gnu, compiled by gcc (SUSE Linux) 4.7.2 20130108 [gcc-4_7-branch revision 195012], 64-bit

查询计划（带有空文件缓存和内存缓存）：

score 1 · Accepted Answer

我认为您的最后一条评论指出了原因：这两个查询是不等价的，除非一个独特的约束开始使它们等价。

等效架构示例：

denis=# \d a
                         Table "public.a"
 Column |  Type   |                   Modifiers                    
--------+---------+------------------------------------------------
 id     | integer | not null default nextval('a_id_seq'::regclass)
 d      | date    | not null
Indexes:
    "a_pkey" PRIMARY KEY, btree (id)
Referenced by:
    TABLE "b" CONSTRAINT "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)

denis=# \d b
       Table "public.b"
 Column |  Type   | Modifiers 
--------+---------+-----------
 a_id   | integer | not null
 val    | integer | not null
Foreign-key constraints:
    "b_a_id_fkey" FOREIGN KEY (a_id) REFERENCES a(id)

使用该模式的等效违规数据：

denis=# select * from a order by d;
 id |     d      
----+------------
  1 | 2014-12-10
  2 | 2014-12-11
  3 | 2014-12-12
  4 | 2014-12-13
  5 | 2014-12-14
  6 | 2014-12-15
(6 rows)

denis=# select * from b order by a_id, val;
 a_id | val 
------+-----
    1 |   1
    1 |   1
    2 |   1
    2 |   1
    2 |   2
    3 |   1
    3 |   1
    3 |   2
(8 rows)

使用两个 IN 子句的行：

denis=# select a.id, a.d from a where a.id in (select b.a_id from b where b.val = 1) and a.id in (select b.a_id from b where b.val = 2) order by d;
 id |     d      
----+------------
  2 | 2014-12-11
  3 | 2014-12-12
(2 rows)

使用两个连接的行：

denis=# select a.id, a.d from a join b b1 on a.id = b1.a_id join b b2 on a.id = b2.a_id where b1.val = 1 and b2.val = 2 order by d;
 id |     d      
----+------------
  2 | 2014-12-11
  2 | 2014-12-11
  3 | 2014-12-12
  3 | 2014-12-12
(4 rows)

不过，我看到您已经对 b (a_id, x, y) 有一个独特的约束。也许将问题突出显示到 Postgres 性能列表中，以了解它在您的特定情况下没有崩溃的原因——或者至少没有生成完全相同的计划。

score 0 · Accepted Answer

        -- The table definitions
CREATE TABLE table_a (
        id     SERIAL NOT NULL PRIMARY KEY
        , d      DATE NOT NULL
        );

CREATE TABLE table_b (
        id     SERIAL NOT NULL PRIMARY KEY
        , a_id INTEGER NOT NULL REFERENCES table_a(id)
        , x VARCHAR NOT NULL
        , y VARCHAR NOT NULL
        );
        -- fake some data
INSERT INTO table_a(d)
SELECT gs
FROM generate_series( '1904-01-01'::timestamp ,'2015-01-01'::timestamp, '1 day'::interval) gs;
INSERT INTO table_b(a_id, x, y) SELECT a.id, 'x1' , 'y1' FROM table_a a;
INSERT INTO table_b(a_id, x, y) SELECT a.id, 'x2' , 'y2' FROM table_a a;
INSERT INTO table_b(a_id, x, y) SELECT a.id, 'x3' , 'y3' FROM table_a a;
DELETE FROM table_b WHERE RANDOM() > 0.3;

CREATE UNIQUE INDEX ON table_a(d, id);  -- date first
CREATE INDEX ON table_b(a_id);          -- supporting the FK

        -- For initialising the statistics
VACUUM ANALYZE table_a;
VACUUM ANALYZE table_b;

        -- original query
EXPLAIN ANALYZE
SELECT a.id
FROM table_a a
WHERE a.id IN (SELECT b.a_id FROM table_b b WHERE b.x='x1' AND b.y='y1')
  AND a.id IN (SELECT b.a_id FROM table_b b WHERE b.x='x2' AND b.y='y2')
order by a.d desc
limit 20;

        -- EXISTS() version
EXPLAIN ANALYZE
SELECT a.id
FROM table_a a
WHERE EXISTS (SELECT * FROM table_b b WHERE b.a_id= a.id AND b.x='x1' AND b.y='y1')
  AND EXISTS (SELECT * FROM table_b b WHERE b.a_id= a.id AND b.x='x2' AND b.y='y2')
order by a.d desc
limit 20;

结果查询计划：

 Limit  (cost=0.87..491.23 rows=20 width=8) (actual time=0.080..0.521 rows=20 loops=1)
   ->  Nested Loop Semi Join  (cost=0.87..15741.40 rows=642 width=8) (actual time=0.080..0.518 rows=20 loops=1)
         ->  Nested Loop Semi Join  (cost=0.58..14380.54 rows=4043 width=12) (actual time=0.017..0.391 rows=74 loops=1)
               ->  Index Only Scan Backward using table_a_d_id_idx on table_a a  (cost=0.29..732.75 rows=40544 width=8) (actual time=0.008..0.048 rows=231 loops=1)
                     Heap Fetches: 0
               ->  Index Scan using table_b_a_id_idx on table_b b_1  (cost=0.29..0.34 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=231)
                     Index Cond: (a_id = a.id)
                     Filter: (((x)::text = 'x2'::text) AND ((y)::text = 'y2'::text))
                     Rows Removed by Filter: 0
         ->  Index Scan using table_b_a_id_idx on table_b b  (cost=0.29..0.34 rows=1 width=4) (actual time=0.001..0.001 rows=0 loops=74)
               Index Cond: (a_id = a.id)
               Filter: (((x)::text = 'x1'::text) AND ((y)::text = 'y1'::text))
               Rows Removed by Filter: 1
 Total runtime: 0.547 ms

这两个查询导致完全相同的查询计划和结果（因为 NOT NULL on tableb.a_id）
一旦你更喜欢索引连接而不是哈希连接，索引table_b(a_id)是绝对必要的（对于 7M//70M 元组，我认为你应该更喜欢索引扫描）
避免了外部查询中的排序（昂贵）（使用索引table_a(d, id)）

postgresql - PostgreSQL 什么时候将子查询折叠到连接，什么时候不？

2 回答 2

Related

Reference