sql - 优化 postgres 搜索查询的问题

Question

我对以下 PostgreSQL 查询有疑问，运行时间超过 10 秒有没有办法将此查询加速到合理的速度，我只是在一个非常大的数据库中寻找与视频相关的最相关的搜索词.

  SELECT count(*), videoid 
  FROM term_search 
  where word = 'tester' 
     OR word = 'question' 
     OR word = 'one' 
  group by videoid 
  order by count(*) desc 
  limit 1800;

当使用分析运行查询时，生成的查询计划如下（http://explain.depesz.com/s/yDJ）：

  Limit  (cost=389625.50..389630.00 rows=1800 width=4) (actual time=11766.693..11770.001 rows=1800 loops=1)
     Output: (count(*)), videoid
     ->  Sort  (cost=389625.50..389692.68 rows=26873 width=4) (actual time=11766.689..11767.818 rows=1800 loops=1)
           Output: (count(*)), videoid
           Sort Key: (count(*))
           Sort Method: top-N heapsort  Memory: 181kB
           ->  HashAggregate  (cost=387769.41..388038.14 rows=26873 width=4) (actual time=9215.653..10641.993 rows=1632578 loops=1)
                 Output: count(*), videoid
                 ->  Bitmap Heap Scan on public.term_search  (cost=44915.83..378163.38 rows=1921207 width=4) (actual time=312.449..7026.036 rows=2047691 loops=1)
                       Output: id, videoid, word, termindex, weight
                       Recheck Cond: (((term_search.word)::text = 'tester'::text) OR ((term_search.word)::text = 'question'::text) OR ((term_search.word)::text = 'one'::text))
                       Rows Removed by Index Recheck: 25512434
                       ->  BitmapOr  (cost=44915.83..44915.83 rows=1950031 width=0) (actual time=288.937..288.937 rows=0 loops=1)
                             ->  Bitmap Index Scan on terms_word_idx  (cost=0.00..8552.83 rows=383502 width=0) (actual time=89.266..89.266 rows=419750 loops=1)
                                   Index Cond: ((term_search.word)::text = 'tester'::text)
                             ->  Bitmap Index Scan on terms_word_idx  (cost=0.00..13171.84 rows=590836 width=0) (actual time=89.700..89.700 rows=604348 loops=1)
                                   Index Cond: ((term_search.word)::text = 'question'::text)
                             ->  Bitmap Index Scan on terms_word_idx  (cost=0.00..21750.26 rows=975693 width=0) (actual time=109.964..109.964 rows=1023593 loops=1)
                                   Index Cond: ((term_search.word)::text = 'one'::text)

该表的架构如下：

    Column   |          Type          |                        Modifiers                         | Storage  | Description 
  -----------+------------------------+----------------------------------------------------------+----------+-------------
   id        | integer                | not null default nextval('term_search_id_seq'::regclass) | plain    | 
   videoid   | integer                |                                                          | plain    | 
   word      | character varying(100) |                                                          | extended | 
   termindex | character varying(15)  |                                                          | extended | 
   weight    | smallint               |                                                          | plain    | 
  Indexes:
      "term_search_pkey" PRIMARY KEY, btree (id)
      "search_term_exists_idx" btree (videoid, word)
      "terms_caverphone_idx" btree (termindex)
      "terms_video_idx" btree (videoid)
      "terms_word_idx" btree (word, videoid)
  Foreign-key constraints:
      "term_search_videoid_fkey" FOREIGN KEY (videoid) REFERENCES videos(id) ON DELETE CASCADE
  Has OIDs: no

我已经设法通过仅索引扫描将其缩短到 7 秒，但仍然不够低。我在 aws r3.xlarge 实例上的 Ubuntu 14.04 上运行 PostgreSQL 9.3，表中有大约 5000 万行。任何意见是极大的赞赏！

编辑：

附件是 SELECT schemaname,tablename,attname,null_frac,avg_width,n_distinct FROM pg_stats WHERE schemaname='public' and tablename='term_search'的结果；

 schemaname |  tablename  |  attname  | null_frac | avg_width | n_distinct 
 ------------+-------------+-----------+-----------+-----------+------------
 public     | term_search | id        |         0 |         4 |         -1
 public     | term_search | videoid   |         0 |         4 |     568632
 public     | term_search | word      |         0 |         6 |       5054
 public     | term_search | termindex |         0 |        11 |       2485
 public     | term_search | weight    |         0 |         2 |          3

score 1 · Accepted Answer

如果我有机会将用户断开一晚，我会：

words使用from创建一个新表term_search，
创建对新表的引用，
删除列word，

像这样的东西：

create table words (
    word_id serial primary key,
    word text);

insert into words (word)
    select distinct word
    from term_search;

alter table term_search add column word_id integer;

update term_search t
    set word_id = w.word_id
    from words w
    where t.word = w.word;

alter table term_search add constraint term_search_word_fkey 
    foreign key (word_id) references words (word_id);

测试：

SELECT count(*), videoid 
FROM term_search t
JOIN words w on t.word_id = w.word_id
WHERE w.word = 'tester' 
   OR w.word = 'question' 
   OR w.word = 'one' 
GROUP BY videoid 
ORDER BY count(*) desc 
LIMIT 1800;    

-- if was faster then
    alter table term_search drop column word;
-- and on the fly...
    alter table term_search alter termindex type text;

革命之后，我必须处理term_search. 我可能会创建一个包含插入和更新规则的视图。

score 1 · Accepted Answer

让我们首先重新表述查询以解释它真正想要做什么。

查询：

  SELECT count(*), videoid 
  FROM term_search 
  where word = 'tester' 
     OR word = 'question' 
     OR word = 'one' 
  group by videoid 
  order by count(*) desc 
  limit 1800;

似乎意味着：

“在搜索词表中，使用搜索词找到我的视频tester，question或one。计算每个视频的匹配项并返回匹配最多的 1800 个视频”。

或者，更一般地说：

“找到与我的搜索词最匹配的视频并向我显示前 n 个最佳匹配”。

正确的？

如果是这样，你为什么不使用PostgreSQL 的内置全文搜索和全文索引？对每个视频的索引tsquery匹配tsvector很可能在这里获胜。全文搜索具有模糊匹配、排名以及您想要的几乎所有其他功能 - 与您当前的方法不同，它不需要对整个数据集进行具体化和排序以丢弃大部分数据。

你还没有提供示例数据，所以我不能真正做一个演示。

PostgreSQL 当前如何执行您的查询可以这样解释：

为表中的每个磁盘页 (8kb) 创建一个映射，其中 true 表示该页可能包含一个或多个匹配行。
对于每个搜索词，扫描索引terms_word_idx并更新位图以设置找到匹配项的位
扫描表格，跳过位图显示不能匹配的页面，寻找包含任何单词的行。这就像一个快速的、跳过空白的 seqscan。如果匹配的百分比很高，它实际上并不比普通的 seqscan 快很多。
对于每个匹配的行，根据视频 ID 将其排序为一系列“桶”。然后最后，计算每个存储桶中有多少行，并返回计数 + 该存储桶的视频 ID。（这不是那么简单，但足够接近）。
在计算每个存储桶时，将结果放在具有次高计数和次低计数的结果之间。
- 拿前 1800 名的结果，扔掉你所有的辛勤工作。

这听起来不是很有趣，但它别无选择。一个 b-tree 索引不能下降到同时搜索多个词，所以它必须做多个索引扫描。其余的都是从那开始的。

所以：为了提高效率，您需要从根本上改变解决问题的方式。添加索引或调整一些参数不会突然使这需要 0.5 秒。

score 0 · Accepted Answer

您可以优化 postgresql 设置以减少查询执行时间。例如，您可以使用pgtune实用程序：

apt-get install pgtune
cd /etc/postgresql/*.*/main/
cp postgresql.conf postgresql.conf.default
pgtune -i postgresql.conf.default -o postgresql.conf --type=%TYPE%

这里 %TYPE% 是值之一：

DATA用于大数据海量、大查询、低频调用
WEB用于 Web 应用程序，最适合 Django 应用程序和其他 WEB 应用程序

您可以在 Google 和帮助中找到有关 pgtune 的其他信息。

对于 PostgreSQL < 9.3，您必须使用此脚本：

#!/bin/bash
# simple shmsetup script
page_size=`getconf PAGE_SIZE`
phys_pages=`getconf _PHYS_PAGES`
shmall=`expr $phys_pages / 2`
shmmax=`expr $shmall \* $page_size`
echo kernel.shmmax = $shmmax
echo kernel.shmall = $shmall

白色结果进入文件 /etc/sysctl.conf 并重新启动系统。否则 Postgres 无法启动。

score 0 · Accepted Answer

其他人已经就如何重构数据库提供了一些建议，但您可能可以使查询运行得更好，因为它现在是。EXPLAIN 中的以下行表明您的位图溢出：

Rows Removed by Index Recheck: 25512434

如果重新检查是消耗时间的原因（而不是 IO 消耗时间——如果你运行EXPLAIN (ANALYZE, BUFFERS)它会帮助澄清这一点，特别是如果你启用了 track_io_timing），那么增加 work_mem 可能会有很大帮助，假设你有能力这样做没有用完RAM。

sql - 优化 postgres 搜索查询的问题

4 回答 4

Related

Reference