postgresql - 为什么子查询中的 distinct on 会损害 PostgreSQL 的性能？

Question

我有一张users带有字段id和email. id是主键并且email也被索引。

database> \d users
+-----------------------------+-----------------------------+-----------------------------------------------------+
| Column                      | Type                        | Modifiers                                           |
|-----------------------------+-----------------------------+-----------------------------------------------------|
| id                          | integer                     |  not null default nextval('users_id_seq'::regclass) |
| email                       | character varying           |                                                     |
+-----------------------------+-----------------------------+-----------------------------------------------------+
Indexes:
    "users_pkey" PRIMARY KEY, btree (id)
    "index_users_on_email" UNIQUE, btree (email)

如果我distinct on (email)在子查询中使用子句查询表，我会受到显着的性能损失。

database> explain (analyze, buffers)
   select
     id
   from (
     select distinct on (email)
       id
     from
       users
   ) as t
   where id = 123
+-----------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                  |
|-----------------------------------------------------------------------------------------------------------------------------|
| Subquery Scan on t  (cost=8898.69..10077.84 rows=337 width=4) (actual time=221.133..250.782 rows=1 loops=1)                 |
|   Filter: (t.id = 123)                                                                                                      |
|   Rows Removed by Filter: 67379                                                                                             |
|   Buffers: shared hit=2824, temp read=288 written=289                                                                       |
|   ->  Unique  (cost=8898.69..9235.59 rows=67380 width=24) (actual time=221.121..247.582 rows=67380 loops=1)                 |
|         Buffers: shared hit=2824, temp read=288 written=289                                                                 |
|         ->  Sort  (cost=8898.69..9067.14 rows=67380 width=24) (actual time=221.120..239.573 rows=67380 loops=1)             |
|               Sort Key: users.email                                                                                         |
|               Sort Method: external merge  Disk: 2304kB                                                                     |
|               Buffers: shared hit=2824, temp read=288 written=289                                                           |
|               ->  Seq Scan on users  (cost=0.00..3494.80 rows=67380 width=24) (actual time=0.009..9.714 rows=67380 loops=1) |
|                     Buffers: shared hit=2821                                                                                |
| Planning Time: 0.243 ms                                                                                                     |
| Execution Time: 251.258 ms                                                                                                  |
+-----------------------------------------------------------------------------------------------------------------------------+

将其与distinct on (id)成本小于上一个查询的千分之一进行比较。

database> explain (analyze, buffers)
   select
     id
   from (
     select distinct on (id)
       id
     from
       users
   ) as t
   where id = 123
+-----------------------------------------------------------------------------------------------------------------------------+
| QUERY PLAN                                                                                                                  |
|-----------------------------------------------------------------------------------------------------------------------------|
| Unique  (cost=0.29..8.31 rows=1 width=4) (actual time=0.021..0.022 rows=1 loops=1)                                          |
|   Buffers: shared hit=3                                                                                                     |
|   ->  Index Only Scan using users_pkey on users  (cost=0.29..8.31 rows=1 width=4) (actual time=0.020..0.020 rows=1 loops=1) |
|         Index Cond: (id = 123)                                                                                              |
|         Heap Fetches: 1                                                                                                     |
|         Buffers: shared hit=3                                                                                               |
| Planning Time: 0.090 ms                                                                                                     |
| Execution Time: 0.034 ms                                                                                                    |
+-----------------------------------------------------------------------------------------------------------------------------+

为什么是这样？

我遇到的真正问题是我正在尝试创建一个视图，该视图执行distinct on一个不唯一的索引列并且性能非常糟糕。

score 4 · Accepted Answer

逻辑差异

两列id和email都是UNIQUE。但id只是NOT NULL。（PRIMARY KEY列总是。）NULL值不被认为是相等的，具有约束（或索引）NULL的列中允许多个值。UNIQUE那是根据标准SQL。看：

允许在唯一列中为空

但是DISTINCT还是DISTINCT ON考虑 NULL 值相等。手册：

显然，如果两行在至少一个列值上不同，则它们被认为是不同的。在此比较中，空值被视为相等。

大胆强调我的。进一步阅读：

在每个 GROUP BY 组中选择第一行？

在您的第二个查询中，distinct on (id)是一个逻辑无操作：保证结果与 without 相同DISTINCT ON。并且由于外部SELECT过滤器打开id = 123，Postgres 可以去除噪音并进行非常便宜的仅索引扫描。

另一方面，在您的第一个查询中，distinct on (email)如果有多个带有email IS NULL. id然后 Postgres 必须根据给定的排序顺序选择第一个。由于没有ORDER BY，因此会导致任意选择。SELECT但是带有谓词的外部where id = 123可能取决于结果。整个查询在本质上与第一个完全不同 - 并且被设计破坏。

机缘巧合

除此之外，还有两个“幸运”的发现：

Sort Method: external merge  Disk: 2304kB

提到“磁盘”表示work_mem不足。看：

Linux 上 PostgreSQL 中的配置参数 work_mem

          ->  Seq Scan on users  (cost=0.00..3494.80 rows=67380

在我的测试中，我总是在这里进行索引扫描。表示您的设置存在臃肿的索引或其他问题。

有用的比较？

比较无处可去。我们可以从比较第一个查询和这个查询中学到一些东西——在切换 PK 和 UNIQUE 列的角色之后：

select email
from  (select distinct on (id) email from users) t
where email = 'user123@foo.com';

或者通过将第二个查询与这个查询进行比较 - 尝试使用 UNIQUE 列而不是 PK 列：

select email
from  (select distinct on (email) email from users) t
where email = 'user123@foo.com';

我们了解到 PK 和 UNIQUE 约束对查询计划没有不同的影响。Postgres 不使用元信息来偷工减料。PK 实际上会与GROUP BY. 看：

GROUP BY 查询的意外行为

所以这有效：

SELECT email
FROM  (
   SELECT email -- no aggregate required, because id = PK
   FROM   users
   GROUP  BY id  -- !
   ) t
WHERE email = 'user123@foo.com';

id但是在切换和之后同样不起作用email。我在小提琴中添加了一些演示：

db<>在这里摆弄

所以？

由于不同的原因，这两个查询都是无意义的。我看不出他们如何帮助您解决真正的查询：

我遇到的真正问题是我正在尝试创建一个视图，该视图在不唯一且性能非常糟糕的索引列上执行不同的操作。

我们需要查看您的真实查询 - 以及您设置的所有其他相关详细信息。可能有解决方案，但这可能远远超出了关于 SO 的问题的范围。考虑聘请顾问。或者考虑使用以下方法之一来优化性能：

优化 GROUP BY 查询以检索每个用户的最新行

postgresql - 为什么子查询中的 distinct on 会损害 PostgreSQL 的性能？

1 回答 1

逻辑差异

机缘巧合

有用的比较？

所以？

Related

Reference