sql - 索引是否保留在派生表上？

Question

假设我有一个大表，其中包含来自各种文件的基因组位置，如下所示：

CREATE TABLE chromosomal_positions (
    file_id  INT,
    chromosome_id INT, 
    position INT
)

我想将 1 个文件的内容与所有其他文件的所有内容进行比较，以查找重叠。所以我使用派生表。

SELECT Count(*) 
FROM   (SELECT * 
        FROM   chromosomal_positions 
        WHERE  variant_file_id = 1) AS file_1 
      JOIN (SELECT * 
            FROM   chromosomal_positions 
            WHERE  variant_file_id != 1) AS other_files 
         ON ( file_1.chromosome_id = other_files.chromosome_id 
              AND file_1.position = other_files.position )

现在，如果我按该顺序在 file_id、chromo_id、位置上添加复合索引，派生表是否能够使用该索引？（使用 Postgres）

score 2 · Accepted Answer

与其说 PostgreSQL 跨子查询“保留”索引，不如说重写器通常可以简化和重组您的查询，以便它直接在基表上操作。

在这种情况下，查询变得不必要地复杂；可以完全消除子查询，从而使这成为微不足道的自联接。

SELECT count(*) 
FROM  chromosomal_positions file_1 
INNER JOIN chromosomal_positions other_files
ON ( file_1.chromosome_id = other_files.chromosome_id 
     AND file_1.position = other_files.position ) 
WHERE file1.variant_file_id = 1
AND   other_files.variant_file_id != 1;

所以这里的索引(chromosome_id, position)显然很有用。

您可以随时尝试索引选择和使用，explain analyze以确定查询计划器实际在做什么。例如，如果我想看看：

那么我会

CREATE INDEX cp_f_c_p ON chromosomal_positions(file_id, chromosome_id , position);

-- Planner would prefer seqscan because there's not really any data;
-- force it to prefer other plans.
SET enable_seqscan = off;

EXPLAIN SELECT count(*) 
FROM (
  SELECT * 
  FROM   chromosomal_positions 
  WHERE  file_id = 1
) AS file_1 
INNER JOIN (
  SELECT * 
  FROM   chromosomal_positions 
  WHERE  file_id != 1
) AS other_files 
ON ( file_1.chromosome_id = other_files.chromosome_id 
     AND file_1.position = other_files.position )

并得到结果：

                                                                                   QUERY PLAN                                                                                   
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Aggregate  (cost=78.01..78.02 rows=1 width=0)
   ->  Hash Join  (cost=29.27..78.01 rows=1 width=0)
         Hash Cond: ((chromosomal_positions_1.chromosome_id = chromosomal_positions.chromosome_id) AND (chromosomal_positions_1."position" = chromosomal_positions."position"))
         ->  Bitmap Heap Scan on chromosomal_positions chromosomal_positions_1  (cost=14.34..48.59 rows=1930 width=8)
               Filter: (file_id <> 1)
               ->  Bitmap Index Scan on cp_f_c_p  (cost=0.00..13.85 rows=1940 width=0)
         ->  Hash  (cost=14.79..14.79 rows=10 width=8)
               ->  Bitmap Heap Scan on chromosomal_positions  (cost=4.23..14.79 rows=10 width=8)
                     Recheck Cond: (file_id = 1)
                     ->  Bitmap Index Scan on cp_f_c_p  (cost=0.00..4.23 rows=10 width=0)
                           Index Cond: (file_id = 1)
(11 rows)

（在 explain.depesz.com 上查看）

表明虽然它会使用索引，但它实际上只将它用于第一列。它不会使用其余的，它只是过滤file_id. 因此，以下索引同样好，而且维护起来更小、更便宜：

CREATE INDEX cp_file_id ON chromosomal_positions(file_id);

果然，如果你创建这个索引 Pg 会更喜欢它。所以不，您提出的索引似乎没有用，除非规划者认为它在这种数据规模上不值得使用，并且可能选择在具有更多数据的完全不同的计划中使用它。您确实必须对真实数据进行测试才能确定。

相比之下，我提出的索引：

CREATE INDEX cp_ci_p ON chromosomal_positions (chromosome_id, position);

用于查找 id = 1 的染色体位置，至少在一个空的虚拟数据集上。不过，我怀疑规划器会避免在比这更大的数据集上使用嵌套循环。再说一次，你真的只需要试试看。

（顺便说一句，如果计划器被迫具体化子查询，那么它不会“保留派生表上的索引”，即具体化的子查询。这与总是具体WITH化的（CTE）查询术语特别相关）。

sql - 索引是否保留在派生表上？

1 回答 1

Related

Reference