sql - 优化分组最大查询

Question

select * 
from records 
where id in ( select max(id) from records group by option_id )

此查询即使在数百万行上也能正常工作。但是，从 explain 语句的结果可以看出：

                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------------------
Nested Loop  (cost=30218.84..31781.62 rows=620158 width=44) (actual time=1439.251..1443.458 rows=1057 loops=1)
->  HashAggregate  (cost=30218.41..30220.41 rows=200 width=4) (actual time=1439.203..1439.503 rows=1057 loops=1)
     ->  HashAggregate  (cost=30196.72..30206.36 rows=964 width=8) (actual time=1438.523..1438.807 rows=1057 loops=1)
           ->  Seq Scan on records records_1  (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.103..527.914 rows=1240315 loops=1)
->  Index Scan using records_pkey on records  (cost=0.43..7.80 rows=1 width=44) (actual time=0.002..0.003 rows=1 loops=1057)
     Index Cond: (id = (max(records_1.id)))
Total runtime: 1443.752 ms

(cost=0.00..23995.15 rows=1240315 width=8)<- 这里它说它正在扫描所有行，这显然是低效的。

我还尝试重新排序查询：

select r.* from records r
inner join (select max(id) id from records group by option_id) r2 on r2.id= r.id;

                                               QUERY PLAN
-------------------------------------------------------------------------------------------------------------------------------

Nested Loop  (cost=30197.15..37741.04 rows=964 width=44) (actual time=835.519..840.452 rows=1057 loops=1)
->  HashAggregate  (cost=30196.72..30206.36 rows=964 width=8) (actual time=835.471..835.836 rows=1057 loops=1)
     ->  Seq Scan on records  (cost=0.00..23995.15 rows=1240315 width=8) (actual time=0.336..348.495 rows=1240315 loops=1)
->  Index Scan using records_pkey on records r  (cost=0.43..7.80 rows=1 width=44) (actual time=0.003..0.003 rows=1 loops=1057)
     Index Cond: (id = (max(records.id)))
Total runtime: 840.809 ms

(cost=0.00..23995.15 rows=1240315 width=8)<- 仍在扫描所有行。

我尝试了在 , , 上使用和不使用索引(option_id)，(option_id, id)它们(option_id, id desc)都没有对查询计划产生任何影响。

有没有办法在 Postgres 中执行分组最大查询而不扫描所有行？

我以编程方式寻找的是一个索引，它存储每个option_id插入记录表时的最大 id。这样，当我查询 option_id 的最大值时，我应该只需要扫描索引记录的次数与 option_id 不同的次数一样多。

我已经select distinct on从高级用户那里看到了所有的答案（感谢@Clodoaldo Neto 给了我要搜索的关键字）。这就是它不起作用的原因：

create index index_name on records(option_id, id desc)

select distinct on (option_id) *
from records
order by option_id, id desc
                                               QUERY PLAN
------------------------------------------------------------------------------------------------------------------------------------------------------------
Unique  (cost=0.43..76053.10 rows=964 width=44) (actual time=0.049..1668.545 rows=1056 loops=1)
  ->  Index Scan using records_option_id_id_idx on records  (cost=0.43..73337.25 rows=1086342 width=44) (actual time=0.046..1368.300 rows=1086342 loops=1)
Total runtime: 1668.817 ms

太好了，它正在使用索引。然而，使用索引扫描所有 id 并没有多大意义。根据我的执行，它实际上比简单的顺序扫描要慢。

有趣的是，MySQL 5.5 能够简单地使用索引来优化查询records(option_id, id)

mysql> select count(1) from records;

+----------+
| count(1) |
+----------+
|  1086342 |
+----------+

1 row in set (0.00 sec)

mysql> explain extended select * from records
       inner join ( select max(id) max_id from records group by option_id ) mr
                                                      on mr.max_id= records.id;

+------+----------+--------------------------+
| rows | filtered | Extra                    |
+------+----------+--------------------------+
| 1056 |   100.00 |                          |
|    1 |   100.00 |                          |
|  201 |   100.00 | Using index for group-by |
+------+----------+--------------------------+

3 rows in set, 1 warning (0.02 sec)

score 14 · Accepted Answer

假设. _ _ _ _optionsrecords

通常，您将有一个从引用的查找表optionsrecords.option_id，理想情况下带有外键约束。如果你不这样做，我建议创建一个来强制引用完整性：

CREATE TABLE options (
  option_id int  PRIMARY KEY
, option    text UNIQUE NOT NULL
);

INSERT INTO options
SELECT DISTINCT option_id, 'option' || option_id -- dummy option names
FROM   records;

那么就没有必要再模拟一个松散的索引扫描，这变得非常简单和快速。相关子查询可以在(option_id, id).

SELECT option_id, (SELECT max(id)
                   FROM   records
                   WHERE  option_id = o.option_id) AS max_id
FROM   options o
ORDER  BY 1;

这包括 table 中不匹配的选项records。你得到 NULL ，如果需要max_id，你可以很容易地在外部删除这些行SELECT。

或（相同的结果）：

SELECT option_id, (SELECT id
                   FROM   records
                   WHERE  option_id = o.option_id
                   ORDER  BY id DESC NULLS LAST
                   LIMIT  1) AS max_id
FROM   options o
ORDER  BY 1;

可能会稍微快一些。子查询使用排序顺序- 与忽略 NULL 值DESC NULLS LAST的聚合函数相同。max()排序只是DESC首先有 NULL ：

为什么在 PostgreSQL 查询中排序 DESC 时 NULL 值排在第一位？

完美的索引：

CREATE INDEX on records (option_id, id DESC NULLS LAST);

定义列时，索引排序顺序无关紧要NOT NULL。

仍然可以对小表进行顺序扫描options，这只是获取所有行的最快方法。可能会ORDER BY引入索引（仅）扫描以获取预排序的行。
大表records只能通过（位图）索引扫描访问，或者如果可能的话，只能通过索引扫描。

db<>fiddle here - 显示简单案例
_{Old sqlfiddle的两个仅索引扫描}

或者在 Postgres 9.3+ 中使用LATERAL连接来获得类似的效果：

优化 GROUP BY 查询以检索每个用户的最新行

score 2 · Accepted Answer

PostgreSQL 不支持 MySQL 能够用于此类查询的松散扫描。这是Using index for group-by您在 MySQL 计划中看到的。

基本上，它返回与复合键子集匹配的范围内的第一个或最后一个条目，然后搜索该子集的下一个或上一个值。

在您的情况下，它首先返回整个索引的最后一个值(option_id, id)（根据定义，它恰好保存MAX(id)了最大的option_id），然后搜索最大旁边的最后一个值，option_id依此类推。

PostgreSQL 的优化器无法构建这样的计划，但是，PostgreSQL 允许您在 SQL 中模拟它。如果您有很多记录但很少有不同的记录option_id，那么值得这样做。

为此，首先创建索引：

CREATE INDEX ix_records_option_id ON records (option_id, id);

然后运行此查询：

WITH RECURSIVE q (option_id) AS
        (
        SELECT  MIN(option_id)
        FROM    records
        UNION ALL
        SELECT  (
                SELECT  MIN(option_id)
                FROM    records
                WHERE   option_id > q.option_id
                )
        FROM    q
        WHERE   option_id IS NOT NULL
        )
SELECT  option_id,
        (
        SELECT  MAX(id)
        FROM    records r
        WHERE   r.option_id = q.option_id
        )
FROM    q
WHERE   option_id IS NOT NULL

在 sqlfiddle.com 上查看：http ://sqlfiddle.com/#!15/4d77d/4

score 2 · Accepted Answer

您提到想要一个仅索引每个 option_id 的 max(id) 的索引。PostgreSQL 目前不支持此功能。如果以后加入这样的功能，可能会通过在聚合查询上制作物化视图，然后对物化视图进行索引的机制来完成。不过，我不会期望至少几年。

但是，您现在可以做的是使用递归查询，使其跳过索引到 option_id 的每个唯一值。有关技术的一般描述，请参阅PostgreSQL wiki 页面。

您可以将其用于您的案例的方式是编写递归查询以返回 option_id 的不同值，然后为其中的每一个子选择 max(id)：

with recursive dist as (
  select min(option_id) as option_id from records
union all
  select (select min(option_id) from records where option_id > dist.option_id) 
     from dist where dist.option_id is not null
) 

select option_id, 
  (select max(id) from records where records.option_id=dist.option_id)
from dist where option_id is not null;

它很丑陋，但您可以将其隐藏在视图后面。

在我手中，这个运行时间为 43 毫秒，而不是 513 毫秒on distinct。

如果您能找到将 max(id) 合并到递归查询中的方法，它可能会快两倍，但我找不到这样做的方法。问题是这些查询具有相当严格的语法，您不能将“limit”或“order by”与 UNION ALL 结合使用。

此查询涉及广泛分散在整个索引中的页面，如果这些页面不适合缓存，那么您将执行大量低效的 IO。但是，如果这种类型的查询很流行，那么 1057 个叶子索引页在缓存中的停留将没有什么问题。

这是设置我的测试用例的方式：

create table records  as select floor(random()*1057)::integer as option_id, floor(random()*50000000)::integer as id from generate_series(1,1240315);
create index on records (option_id ,id);
explain analyze;

score 1 · Accepted Answer

select distinct on (option_id) *
from records
order by option_id, id desc

仅当基数有利时才会使用索引。也就是说，您可以尝试使用复合索引

create index index_name on records(option_id, id desc)

sql - 优化分组最大查询

4 回答 4

Related

Reference