0

我在 PostgreSQL 13.1 上有表(超过 1 亿条记录)

CREATE TABLE report
(
    id     serial primary key,
    license_plate_id integer,
    datetime timestamp
);

索引(为了测试我创建了它们):

create index report_lp_datetime_index on report (license_plate_id, datetime);
create index report_lp_datetime_desc_index on report (license_plate_id desc, datetime desc);

所以,我的问题是为什么查询像

select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75)
order by datetime desc
limit 100

非常慢(~10 秒)。但是没有订单语句的查询很快(毫秒)。

解释:

explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34, 75,374,57123)
limit 100
Limit  (cost=0.57..400.38 rows=100 width=316) (actual time=0.037..0.216 rows=100 loops=1)
  Buffers: shared hit=103
  ->  Index Scan using report_lp_id_idx on report r  (cost=0.57..44986.97 rows=11252 width=316) (actual time=0.035..0.202 rows=100 loops=1)
        Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
        Buffers: shared hit=103
Planning Time: 0.228 ms
Execution Time: 0.251 ms


explain (analyze, buffers, format text) select * from report r
where r.license_plate_id in (1,2,4,5,6,7,8,10,15,22,34,75,374,57123)
order by datetime desc
limit 100
Limit  (cost=44193.63..44193.88 rows=100 width=316) (actual time=4921.030..4921.047 rows=100 loops=1)
  Buffers: shared hit=11455 read=671
  ->  Sort  (cost=44193.63..44221.76 rows=11252 width=316) (actual time=4921.028..4921.035 rows=100 loops=1)
        Sort Key: datetime DESC
        Sort Method: top-N heapsort  Memory: 128kB
        Buffers: shared hit=11455 read=671
        ->  Bitmap Heap Scan on report r  (cost=151.18..43763.59 rows=11252 width=316) (actual time=54.422..4911.927 rows=12148 loops=1)
              Recheck Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
              Heap Blocks: exact=12063
              Buffers: shared hit=11455 read=671
              ->  Bitmap Index Scan on report_lp_id_idx  (cost=0.00..148.37 rows=11252 width=0) (actual time=52.631..52.632 rows=12148 loops=1)
                    Index Cond: (license_plate_id = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75,374,57123}'::integer[]))
                    Buffers: shared hit=59 read=4
Planning Time: 0.427 ms
Execution Time: 4921.128 ms
4

2 回答 2

1

如果从磁盘读取 671 个 8kB 块需要几秒钟,您的存储似乎相当慢。

加快这个速度的方法是按照与索引相同的方式对表进行重新排序,以便您可以在相同或相邻的表块中找到所需的行:

CLUSTER report_lp_id_idx USING report_lp_id_idx;

请注意,以这种方式重写表会导致停机 - 表在重写时将不可用。而且PostgreSQL不维护表的顺序,所以后续的数据修改会导致性能逐渐变差,过一段时间又得CLUSTER重新运行。

但是,如果您无论如何都需要快速查询此查询,那么CLUSTER就是要走的路。

于 2021-02-01T04:40:11.127 回答
0

你的两个索引做同样的事情,所以你可以删除第二个,它没用。

为了优化您的查询,索引内的字段顺序必须颠倒:

create index report_lp_datetime_index on report (datetime,license_plate_id);


BEGIN;
CREATE TABLE foo (d INTEGER, i INTEGER);
INSERT INTO foo SELECT random()*100000, random()*1000 FROM generate_series(1,1000000) s;
CREATE INDEX foo_d_i ON foo(d DESC,i);
COMMIT;
VACUUM ANALYZE foo;
EXPLAIN ANALYZE SELECT * FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100;

 Limit  (cost=0.42..343.92 rows=100 width=8) (actual time=0.076..9.359 rows=100 loops=1)
   ->  Index Only Scan Backward using foo_d_i on foo  (cost=0.42..40976.43 rows=11929 width=8) (actual time=0.075..9.339 rows=100 loops=1)
         Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
         Rows Removed by Filter: 9016
         Heap Fetches: 0
 Planning Time: 0.339 ms
 Execution Time: 9.387 ms

注意索引不用于优化 WHERE 子句。它在这里用作一种紧凑且快速的方式来存储对按日期 DESC 排序的行的引用,因此 ORDER BY 可以进行仅索引扫描并避免排序。通过将列 id 添加到索引中,可以执行仅索引扫描以测试 id 上的条件,而无需针对每一行都命中表。由于 LIMIT 值较低,它不需要扫描整个索引,它只按日期 DESC 顺序扫描它,直到找到足够的满足 WHERE 条件的行来返回结果。

如果您按日期 DESC 顺序创建索引会更快,如果您在其他查询中使用 ORDER BY date DESC + LIMIT 也会很有用。

您忘记了 OP 的表有第三列,而他正在使用 SELECT *。所以这不会是仅索引扫描。

易于解决。执行此查询的最佳方法是仅索引扫描以过滤 WHERE 条件,然后 LIMIT,然后点击表以获取行。出于某种原因,如果使用“select *”,postgres 从表中获取 id 列,而不是从索引中获取,这会导致对 id 被 WHERE 条件拒绝的行进行大量不必要的堆提取。

很容易解决,通过手动完成。我还添加了另一个虚假列以确保 SELECT * 命中表。

EXPLAIN (ANALYZE,buffers) SELECT * FROM foo 
JOIN (SELECT d,i FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (d,i) 
ORDER BY d DESC LIMIT 100;

 Limit  (cost=0.85..1281.94 rows=1 width=17) (actual time=0.052..3.618 rows=100 loops=1)
   Buffers: shared hit=453
   ->  Nested Loop  (cost=0.85..1281.94 rows=1 width=17) (actual time=0.050..3.594 rows=100 loops=1)
         Buffers: shared hit=453
         ->  Limit  (cost=0.42..435.44 rows=100 width=8) (actual time=0.037..2.953 rows=100 loops=1)
               Buffers: shared hit=53
               ->  Index Only Scan using foo_d_i on foo foo_1  (cost=0.42..51936.43 rows=11939 width=8) (actual time=0.037..2.935 rows=100 loops=1)
                     Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
                     Rows Removed by Filter: 9010
                     Heap Fetches: 0
                     Buffers: shared hit=53
         ->  Index Scan using foo_d_i on foo  (cost=0.42..8.45 rows=1 width=17) (actual time=0.005..0.005 rows=1 loops=100)
               Index Cond: ((d = foo_1.d) AND (i = foo_1.i))
               Buffers: shared hit=400
 Execution Time: 3.663 ms

另一种选择是将主键添加到 date,license_plate 索引中。

SELECT * FROM foo JOIN (SELECT id FROM foo WHERE i IN (1,2,4,5,6,7,8,10,15,22,34,75) ORDER BY d DESC LIMIT 100) f USING (id) ORDER BY d DESC LIMIT 100;

 Limit  (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.920..3.947 rows=100 loops=1)
   Buffers: shared hit=473
   ->  Sort  (cost=1357.98..1358.23 rows=100 width=17) (actual time=3.919..3.931 rows=100 loops=1)
         Sort Key: foo.d DESC
         Sort Method: quicksort  Memory: 32kB
         Buffers: shared hit=473
         ->  Nested Loop  (cost=0.85..1354.66 rows=100 width=17) (actual time=0.055..3.858 rows=100 loops=1)
               Buffers: shared hit=473
               ->  Limit  (cost=0.42..509.41 rows=100 width=8) (actual time=0.039..3.116 rows=100 loops=1)
                     Buffers: shared hit=73
                     ->  Index Only Scan using foo_d_i_id on foo foo_1  (cost=0.42..60768.43 rows=11939 width=8) (actual time=0.039..3.093 rows=100 loops=1)
                           Filter: (i = ANY ('{1,2,4,5,6,7,8,10,15,22,34,75}'::integer[]))
                           Rows Removed by Filter: 9010
                           Heap Fetches: 0
                           Buffers: shared hit=73
               ->  Index Scan using foo_pkey on foo  (cost=0.42..8.44 rows=1 width=17) (actual time=0.006..0.006 rows=1 loops=100)
                     Index Cond: (id = foo_1.id)
                     Buffers: shared hit=400
Execution Time: 3.972 ms

编辑

想了想……由于 LIMIT 将输出限制为按日期 desc 排序的 100 行,如果我们可以为每个 license_plate_id 获取最近的 100 行,将所有这些都放入 top-n 排序,那不是很好吗?并且只为所有 license_plate_ids 保留最好的 100 个?这将避免从索引中读取和丢弃大量行。即使这比访问表要快得多,它仍然会将这些索引页面加载到 RAM 中,并用实际上不需要保存在缓存中的内容堵塞缓冲区。让我们使用横向连接:

EXPLAIN (ANALYZE,BUFFERS) 
SELECT * FROM foo 
  JOIN (SELECT d,i FROM 
    (VALUES (1),(2),(4),(5),(6),(7),(8),(10),(15),(22),(34),(75)) idlist 
    CROSS JOIN LATERAL 
    (SELECT d,i FROM foo WHERE i=idlist.column1 ORDER BY d DESC LIMIT 100) f2 
    ORDER BY d DESC LIMIT 100
  ) f3 USING (d,i)
  ORDER BY d DESC LIMIT 100;

它甚至更快:2 毫秒,它使用 (license_plate_id,date) 上的索引,而不是相反。此外,这很重要,因为横向中的每个子查询只命中包含实际选择的行的索引页面,而之前的查询命中更多的索引页面。因此,您可以节省 RAM 缓冲区。

如果您不需要 (date,license_plate_id) 上的索引并且不想保留无用的索引,那可能会很有趣,因为此查询不使用它。另一方面,如果您需要 (date,license_plate_id) 上的索引以获取其他内容并且想要保留它,那么......也许不需要。

请发布获胜查询的结果

于 2021-01-31T14:16:37.817 回答