0

我有两张表,一张用于个人资料,一张用于个人资料的就业状况。这两个表具有一对一的关系。一份个人资料可能没有工作状态。表模式如下(为清楚起见,删除了不相关的列):

create type employment_status as enum ('claimed', 'approved', 'denied');

create table if not exists profiles
(
    id bigserial not null
        constraint profiles_pkey
            primary key
);

create table if not exists employments
(
    id bigserial not null
        constraint employments_pkey
            primary key,
    status employment_status not null,
    profile_id bigint not null
        constraint fk_rails_d95865cd58
            references profiles
                on delete cascade
);

create unique index if not exists index_employments_on_profile_id
    on employments (profile_id);

使用这些表格,我被要求列出所有失业的个人资料。失业档案被定义为没有就业记录或就业状态不是“已批准”的档案。

我的第一个尝试是以下查询:

SELECT * FROM "profiles" 
LEFT JOIN employments ON employments.profile_id = profiles.id 
WHERE employments.status != 'approved'

这里的假设是所有配置文件都将与他们各自的工作一起列出,然后我可以使用 where 条件过滤它们。任何没有就业记录的个人资料都将具有就业状态,null因此会被条件过滤。但是,此查询不会返回没有工作的个人资料。

经过一番研究,我发现了这篇文章,解释了为什么它不起作用并转换了我的查询:

SELECT *
FROM profiles
LEFT JOIN employments ON profiles.id = employments.profile_id and employments.status != 'approved';

实际上确实有效。但是,我的 ORM 产生了一个稍微不同的查询,但它不起作用。

SELECT profiles.* FROM "profiles" 
LEFT JOIN employments ON employments.profile_id = profiles.id AND employments.status != 'approved'

唯一的区别是 select 子句。我试图理解为什么这种细微的差异会产生如此大的差异,然后对所有三个查询进行了解释分析:

EXPLAIN ANALYZE SELECT * FROM "profiles" 
LEFT JOIN employments ON employments.profile_id = profiles.id 
WHERE employments.status != 'approved'

Hash Join  (cost=14.28..37.13 rows=846 width=452) (actual time=0.025..0.027 rows=2 loops=1)
  Hash Cond: (e.profile_id = profiles.id)
  ->  Seq Scan on employments e  (cost=0.00..20.62 rows=846 width=68) (actual time=0.008..0.009 rows=2 loops=1)
        Filter: (status <> ''approved''::employment_status)
        Rows Removed by Filter: 1
  ->  Hash  (cost=11.90..11.90 rows=190 width=384) (actual time=0.007..0.007 rows=8 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 12kB
        ->  Seq Scan on profiles  (cost=0.00..11.90 rows=190 width=384) (actual time=0.003..0.004 rows=8 loops=1)
Planning Time: 0.111 ms
Execution Time: 0.053 ms
EXPLAIN ANALYZE SELECT *
FROM profiles
LEFT JOIN employments ON profiles.id = employments.profile_id and employments.status != 'approved';

Hash Right Join  (cost=14.28..37.13 rows=846 width=452) (actual time=0.036..0.042 rows=8 loops=1)
  Hash Cond: (employments.profile_id = profiles.id)
  ->  Seq Scan on employments  (cost=0.00..20.62 rows=846 width=68) (actual time=0.005..0.005 rows=2 loops=1)
        Filter: (status <> ''approved''::employment_status)
        Rows Removed by Filter: 1
  ->  Hash  (cost=11.90..11.90 rows=190 width=384) (actual time=0.015..0.015 rows=8 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 12kB
        ->  Seq Scan on profiles  (cost=0.00..11.90 rows=190 width=384) (actual time=0.010..0.011 rows=8 loops=1)
Planning Time: 0.106 ms
Execution Time: 0.108 ms
EXPLAIN ANALYZE SELECT profiles.* FROM "profiles" 
LEFT JOIN employments ON employments.profile_id = profiles.id AND employments.status != 'approved'

Seq Scan on profiles  (cost=0.00..11.90 rows=190 width=384) (actual time=0.006..0.007 rows=8 loops=1)
Planning Time: 0.063 ms
Execution Time: 0.016 ms

第一个和第二个查询计划几乎相同,一个是哈希连接,另一个是右哈希连接,而最后一个查询甚至不做连接或 where 条件。

我想出了一个确实有效的第四个查询:

EXPLAIN ANALYZE SELECT profiles.* FROM profiles 
LEFT JOIN employments ON employments.profile_id = profiles.id 
WHERE (employments.id IS NULL OR employments.status != 'approved')

Hash Right Join  (cost=14.28..35.02 rows=846 width=384) (actual time=0.021..0.026 rows=7 loops=1)
  Hash Cond: (employments.profile_id = profiles.id)
  Filter: ((employments.id IS NULL) OR (employments.status <> ''approved''::employment_status))
  Rows Removed by Filter: 1
  ->  Seq Scan on employments  (cost=0.00..18.50 rows=850 width=20) (actual time=0.002..0.003 rows=3 loops=1)
  ->  Hash  (cost=11.90..11.90 rows=190 width=384) (actual time=0.011..0.011 rows=8 loops=1)
        Buckets: 1024  Batches: 1  Memory Usage: 12kB
        ->  Seq Scan on profiles  (cost=0.00..11.90 rows=190 width=384) (actual time=0.007..0.008 rows=8 loops=1)
Planning Time: 0.104 ms
Execution Time: 0.049 ms

我对这个主题的问题是:

  1. 为什么第二个和第三个查询的查询计划不同,即使它们具有相同的结构?
  2. 为什么查询计划第一个和第四个查询不同,即使它们的结构相同?
  3. 为什么 Postgres 完全忽略我的连接以及第三个查询的条件?

编辑:

对于以下示例数据,预期的查询应返回 2 和 3。

insert into profiles values (1);
insert into profiles values (2);
insert into profiles values (3);

insert into employments (profile_id, status) values (1, 'approved');
insert into employments (profile_id, status) values (2, 'denied');
4

1 回答 1

0

必须有唯一或主键约束employments.profile_id(或者它是具有适当DISTINCT子句的视图),以便优化器知道最多可以有一行employmentsprofiles.

如果是这种情况并且您不在列表中使用employments' 列SELECT,则优化器会推断出连接是多余的并且不需要计算,这使得执行计划更简单、更快。

请参阅join_is_removablein的评论src/backend/optimizer/plan/analyzejoins.c

/*
 * join_is_removable
 *    Check whether we need not perform this special join at all, because
 *    it will just duplicate its left input.
 *
 * This is true for a left join for which the join condition cannot match
 * more than one inner-side row.  (There are other possibly interesting
 * cases, but we don't have the infrastructure to prove them.)  We also
 * have to check that the inner side doesn't generate any variables needed
 * above the join.
 */
于 2020-09-30T06:43:16.833 回答