postgresql - Postgres三元组搜索很慢

Question

我有一张大约有 300 万行的表格。我在表的多个列上创建了单个 gin 索引。

CREATE INDEX search_idx ON customer USING gin (name gin_trgm_ops, id gin_trgm_ops, data gin_trgm_ops)

我正在运行以下查询（简化为在条件中使用单列），但大约需要 4 秒：

EXPLAIN ANALYSE
SELECT c.id, similarity(c.name, 'john') sml
FROM customer c WHERE
c.name % 'john'
ORDER BY sml DESC
LIMIT 10

输出查询计划为：

Limit (cost=9255.12..9255.14 rows=10 width=30) (actual time=3771.661..3771.665 rows=10 loops=1)
  -> Sort (cost=9255.12..9260.43 rows=2126 width=30) (actual time=3771.659..3771.661 rows=10 loops=1)
       Sort Key: (similarity((name)::text, 'john'::text)) DESC
       Sort Method: top-N heapsort Memory: 26kB
       -> Bitmap Heap Scan on customer c (cost=1140.48..9209.18 rows=2126 width=30) (actual time=140.665..3770.478 rows=3605 loops=1)
            Recheck Cond: ((name)::text % 'john'::text)
            Rows Removed by Index Recheck: 939598
            Heap Blocks: exact=158055 lossy=132577
            -> Bitmap Index Scan on search_idx (cost=0.00..1139.95 rows=2126 width=0) (actual time=105.609..105.610 rows=458131 loops=1)
                 Index Cond: ((name)::text % 'john'::text)
Planning Time: 0.102 ms

我不明白为什么在第一步从 search_idx 检索时行没有排序并且限制为 10，然后从客户表中仅获取 10 行（而不是 2126 行）

任何想法如何使这个查询更快。我尝试了 gist index，但没有看到性能提升。我还尝试将 work_mem 从 4MB 增加到 32MB，我可以看到 1 秒的改进，但不会更多。我还注意到，即使我在 SELECT 子句中删除 c.id，postgres 也不会执行仅索引扫描，并且仍会与主表连接。

谢谢您的帮助。

更新 1：在 Laurenz Albe 以下建议后，查询性能有所提高，现在约为 600 毫秒。计划现在看起来像这样：

Subquery Scan on q  (cost=0.41..78.29 rows=1 width=12) (actual time=63.150..574.536 rows=10 loops=1)
  Filter: ((q.name)::text % 'john'::text)
  ->  Limit  (cost=0.41..78.16 rows=10 width=40) (actual time=63.148..574.518 rows=10 loops=1)
        ->  Index Scan using search_name_idx on customer c  (cost=0.41..2232864.76 rows=287182 width=40) (actual time=63.146..574.513 rows=10 loops=1)
              Order By: ((name)::text <-> 'john'::text)
Planning Time: 42.671 ms
Execution Time: 585.554 ms

score 1 · Accepted Answer

要获得具有索引支持的 10 个最接近的匹配项，您应该创建一个 GiST 索引和查询，如下所示：

SELECT id, sml
FROM (SELECT c.id,
             c.name,
             similarity(c.name, 'john') sml
      FROM customer c
      ORDER BY c.name <-> 'john'
      LIMIT 10) AS q
WHERE name % 'john';

子查询可以使用 GiST 索引，外部查询排除所有不超过pg_trgm.similarity_threshold.

score 0 · Accepted Answer

我将查询更改为使用word_similarity，结果很好可能是因为单词相似度的相似度阈值更高。在此处对所有列使用 GIN 索引。

EXPLAIN analyse select id, name, city,
      greatest(word_similarity('john', id),
               word_similarity('john', name),
               word_similarity('john', city)) sml
From platform_vendor
WHERE
   ('city' <% id
       OR 'city' <% name
       OR 'city' <% city)
ORDER BY sml desc limit 10

解释计划仍然在 gin 索引上显示位图索引扫描，然后在表上进行位图堆扫描以重新检查索引，如果我们使用水平分区可以避免这种情况。

postgresql - Postgres三元组搜索很慢

2 回答 2

Related

Reference