database - Postgresql 查询优化不允许内/外连接

Question

我得到了这个查询来优化 POSTGRESQL 9.2：

SELECT C.name, COUNT(DISTINCT I.id) AS NumItems, COUNT(B.id)
FROM Categories C INNER JOIN Items I ON(C.id = I.category) 
                  INNER JOIN Bids B ON (I.id = B.item_id)
GROUP BY C.name

作为我学校作业的一部分。

我在各个表上创建了这些索引：items(category)-->2ndary b+tree、bids(item_id)-->2ndary b+tree 和categories(id)-->primary index here，

奇怪的是，PostgreSQL 正在对我的 Items、Categories 和 Bids 表进行顺序扫描，当我设置时enable_seqscan=off，索引搜索结果比下面的结果更可怕。

当我在 PostgreSQL 中运行解释时，结果如下：请不要删除缩进，因为它们很重要！

GroupAggregate  (cost=119575.55..125576.11 rows=20 width=23) (actual time=6912.523..9459.431 rows=20 loops=1)
  Buffers: shared hit=30 read=12306, temp read=6600 written=6598
  ->  Sort  (cost=119575.55..121075.64 rows=600036 width=23) (actual time=6817.015..8031.285 rows=600036 loops=1)
        Sort Key: c.name
        Sort Method: external merge  Disk: 20160kB
        Buffers: shared hit=30 read=12306, temp read=6274 written=6272
        ->  Hash Join  (cost=9416.95..37376.03 rows=600036 width=23) (actual time=407.974..3322.253 rows=600036 loops=1)
              Hash Cond: (b.item_id = i.id)
              Buffers: shared hit=30 read=12306, temp read=994 written=992
              ->  Seq Scan on bids b  (cost=0.00..11001.36 rows=600036 width=8) (actual time=0.009..870.898 rows=600036 loops=1)
                    Buffers: shared hit=2 read=4999
              ->  Hash  (cost=8522.95..8522.95 rows=50000 width=19) (actual time=407.784..407.784 rows=50000 loops=1)
                    Buckets: 4096  Batches: 2  Memory Usage: 989kB
                    Buffers: shared hit=28 read=7307, temp written=111
                    ->  Hash Join  (cost=1.45..8522.95 rows=50000 width=19) (actual time=0.082..313.211 rows=50000 loops=1)
                          Hash Cond: (i.category = c.id)
                          Buffers: shared hit=28 read=7307
                          ->  Seq Scan on items i  (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.004..144.554 rows=50000 loops=1)
                                Buffers: shared hit=27 read=7307
                          ->  Hash  (cost=1.20..1.20 rows=20 width=19) (actual time=0.062..0.062 rows=20 loops=1)
                                Buckets: 1024  Batches: 1  Memory Usage: 1kB
                                Buffers: shared hit=1
                                ->  Seq Scan on categories c  (cost=0.00..1.20 rows=20 width=19) (actual time=0.004..0.028 rows=20 loops=1)
                                      Buffers: shared hit=1
Total runtime: 9473.257 ms

请在 explain.depesz.com 上查看此计划。

我只想知道为什么会发生这种情况，即与顺序扫描相比，为什么索引会使查询变得非常慢。

编辑：我想我已经设法通过阅读 postgresql 文档发现了一些东西。Postgresql 决定对某些表（例如 bids 和 items）进行 seq 扫描，因为它预测它必须检索表中的每一行（比较实际时间之前括号中的行数和实际时间部分中的行数）。顺序扫描更适合检索所有行。好吧，在那部分什么都做不了。

我已经创建了额外的索引categories(name)，下面的结果就是我所拥有的。它以某种方式得到了改进，但现在哈希连接被嵌套循环所取代。为什么有任何线索？

GroupAggregate  (cost=0.00..119552.02 rows=20 width=23) (actual time=617.330..7725.314 rows=20 loops=1)
  Buffers: shared hit=178582 read=37473 written=14, temp read=2435 written=436
  ->  Nested Loop  (cost=0.00..115051.55 rows=600036 width=23) (actual time=0.120..6186.496 rows=600036 loops=1)
        Buffers: shared hit=178582 read=37473 written=14, temp read=2109 written=110
        ->  Nested Loop  (cost=0.00..26891.55 rows=50000 width=19) (actual time=0.066..2827.955 rows=50000 loops=1)
              Join Filter: (c.id = i.category)
              Rows Removed by Join Filter: 950000
              Buffers: shared hit=2 read=7334 written=1, temp read=2109 written=110
              ->  Index Scan using categories_name_idx on categories c  (cost=0.00..12.55 rows=20 width=19) (actual time=0.039..0.146 rows=20 loops=1)
                    Buffers: shared hit=1 read=1
              ->  Materialize  (cost=0.00..8280.00 rows=50000 width=8) (actual time=0.014..76.908 rows=50000 loops=20)
                    Buffers: shared hit=1 read=7333 written=1, temp read=2109 written=110
                    ->  Seq Scan on items i  (cost=0.00..7834.00 rows=50000 width=8) (actual time=0.007..170.464 rows=50000 loops=1)
                          Buffers: shared hit=1 read=7333 written=1
        ->  Index Scan using bid_itemid_idx on bids b  (cost=0.00..1.60 rows=16 width=8) (actual time=0.016..0.036 rows=12 loops=50000)
              Index Cond: (item_id = i.id)
              Buffers: shared hit=178580 read=30139 written=13
Total runtime: 7726.392 ms

看看这里的计划是否更好。

通过在 category(id) 和items(category). Postgresql 使用这两个索引来获得 114062.92 成本。但是，现在 postgresql 正在与我玩游戏，不使用索引！为什么这么马车？

score 1 · Accepted Answer

EXPLAIN感谢您在未经询问的情况下发布输出，以及EXPLAIN (BUFFERS, ANALYZE).

查询性能问题的一个重要部分可能是外部排序计划节点，该节点正在使用临时文件进行磁盘合并排序：

Sort Method: external merge Disk: 20160kB

您可以通过设置在内存中进行这种排序：

SET work_mem = '50MB';

在运行查询之前。此设置也可以按用户、每个数据库或在postgresql.conf.

由于查询当前是结构化的，我不相信添加索引会带来很多好处。它需要读取和连接所有三个表中的所有行，而哈希连接可能是最快的方法。

我怀疑还有其他方法可以表达将使用完全不同且更有效的执行策略的查询，但是我对它们可能是什么有一个大脑褪色并且不想花时间来制作虚拟表来玩。更多work_mem应该可以显着改善查询。

score 0 · Accepted Answer

从查询计划我们可以看到：
1. 结果和类别有 20 条记录
2. 具有类别的项目是顺序扫描中所有项目数量的 5%
"Rows Removed by Join Filter: 950000" "rows=50000" 3. 投标匹配是 rows=600036 （你能给我们出价的总数吗？） 4. 每个类别都有出价吗？

所以我们想在项目（类别）和出价（项目ID）上使用索引。我们还希望排序适合内存。

 select  
   (select name from Categories where id = foo.category) as name, 
   count(foo.id),  
   sum(foo.bids_count)  
 from 
   (select 
      id,  
      category,  
      (select count(item_id) from Bids where item_id = i.id) as bids_count  
    from Items i  
    where category in (select id from Categories)  
      and exists (select 1 from Bids where item_id = i.id)  
   ) as foo  
  group by foo.category  
  order by name

当然，您必须记住，它严格取决于第 1 点和第 2 点中的数据。

如果 4 为真，您可以从查询中删除存在的部分。

有什么建议或想法吗？

score 0 · Accepted Answer

请注意，如果的大小bids系统地且显着大于items则实际上遍历items两次（特别是如果items适合 RAM）可能比从连接结果中挑选那些不同的项目 ID 更便宜（即使您在内存中排序） . 此外，根据 Postgres 管道碰巧从重复表中提取数据的方式，即使在不利的负载或内存条件下，也可能会受到有限的惩罚（这将是一个很好的问题，您可以在pgsql-general上提出。）采取：

SELECT name, IC.cnt, BC.cnt FROM
Categories C,
( SELECT category, count(1) cnt from Items I GROUP BY category ) IC,
( SELECT category, count(1) cnt from Bids B INNER JOIN Items I ON (I.id = B.item_id) GROUP BY category ) BC
WHERE IC.category=C.id AND BC.category=id;

便宜多少？至少 4 倍给定足够的缓存，即 610 毫秒与 2500 毫秒（内存中排序），有 20 个类别、5 万个项目和 60 万个出价，并且在我的机器上刷新文件系统缓存后仍然比 2 倍快。

请注意，以上内容不能直接替代您的原始查询；对于一个它假设类别 ID 和名称之间存在 1:1 映射（这可能是一个非常合理的假设；如果不是，简单地SUM(BC.cnt)和SUM(IC.cnt)你一样GROUP BY name），但更重要的是，每个类别的项目计数包括项目没有出价，不像你原来的INNER JOIN。如果只需要投标项目计数，您可以WHERE EXISTS (SELECT 1 FROM Bids B where item_id=I.id)在 IC 子查询中添加；这也将遍历Bids第二次（在我的情况下，这对现有的 ~600ms 计划增加了 ~200ms 的惩罚，仍然远低于 2400ms。）

database - Postgresql 查询优化 不允许内/外连接

3 回答 3

Related

Reference

database - Postgresql 查询优化不允许内/外连接