1

我正在使用 postgres 9.2.4。

我们有一个后台作业,将用户的电子邮件导入我们的系统并将它们存储在 postgres 数据库表中。

下面是表格:

CREATE TABLE emails
(
  id serial NOT NULL,
  subject text,
  body text,
  personal boolean,
  sent_at timestamp without time zone NOT NULL,
  created_at timestamp without time zone,
  updated_at timestamp without time zone,
  account_id integer NOT NULL,
  sender_user_id integer,
  sender_contact_id integer,
  html text,
  folder text,
  draft boolean DEFAULT false,
  check_for_response timestamp without time zone,
  send_time timestamp without time zone,
  CONSTRAINT emails_pkey PRIMARY KEY (id),
  CONSTRAINT emails_account_id_fkey FOREIGN KEY (account_id)
      REFERENCES accounts (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE CASCADE,
  CONSTRAINT emails_sender_contact_id_fkey FOREIGN KEY (sender_contact_id)
      REFERENCES contacts (id) MATCH SIMPLE
      ON UPDATE NO ACTION ON DELETE CASCADE
)
WITH (
  OIDS=FALSE
);
ALTER TABLE emails
  OWNER TO paulcowan;

-- Index: emails_account_id_index

-- DROP INDEX emails_account_id_index;

CREATE INDEX emails_account_id_index
  ON emails
  USING btree
  (account_id);

-- Index: emails_sender_contact_id_index

-- DROP INDEX emails_sender_contact_id_index;

CREATE INDEX emails_sender_contact_id_index
  ON emails
  USING btree
  (sender_contact_id);

-- Index: emails_sender_user_id_index

-- DROP INDEX emails_sender_user_id_index;

CREATE INDEX emails_sender_user_id_index
  ON emails
  USING btree
  (sender_user_id);

查询更加复杂,因为我在这个表上有一个视图,我在其中提取了其他数据:

CREATE OR REPLACE VIEW email_graphs AS 
 SELECT emails.id, emails.subject, emails.body, emails.folder, emails.html, 
    emails.personal, emails.draft, emails.created_at, emails.updated_at, 
    emails.sent_at, emails.sender_contact_id, emails.sender_user_id, 
    emails.addresses, emails.read_by, emails.check_for_response, 
    emails.send_time, ts.ids AS todo_ids, cs.ids AS call_ids, 
    ds.ids AS deal_ids, ms.ids AS meeting_ids, c.comments, p.people, 
    atts.ids AS attachment_ids
   FROM emails
   LEFT JOIN ( SELECT todos.reference_email_id AS email_id, 
            array_to_json(array_agg(todos.id)) AS ids
           FROM todos
          GROUP BY todos.reference_email_id) ts ON ts.email_id = emails.id
   LEFT JOIN ( SELECT calls.reference_email_id AS email_id, 
       array_to_json(array_agg(calls.id)) AS ids
      FROM calls
     GROUP BY calls.reference_email_id) cs ON cs.email_id = emails.id
   LEFT JOIN ( SELECT deals.reference_email_id AS email_id, 
    array_to_json(array_agg(deals.id)) AS ids
   FROM deals
  GROUP BY deals.reference_email_id) ds ON ds.email_id = emails.id
   LEFT JOIN ( SELECT meetings.reference_email_id AS email_id, 
    array_to_json(array_agg(meetings.id)) AS ids
   FROM meetings
  GROUP BY meetings.reference_email_id) ms ON ms.email_id = emails.id
   LEFT JOIN ( SELECT comments.email_id, 
    array_to_json(array_agg(( SELECT row_to_json(r.*) AS row_to_json
           FROM ( VALUES (comments.id,comments.text,comments.author_id,comments.created_at,comments.updated_at)) r(id, text, author_id, created_at, updated_at)))) AS comments
   FROM comments
  WHERE comments.email_id IS NOT NULL
  GROUP BY comments.email_id) c ON c.email_id = emails.id
   LEFT JOIN ( SELECT email_participants.email_id, 
    array_to_json(array_agg(( SELECT row_to_json(r.*) AS row_to_json
           FROM ( VALUES (email_participants.user_id,email_participants.contact_id,email_participants.kind)) r(user_id, contact_id, kind)))) AS people
   FROM email_participants
  GROUP BY email_participants.email_id) p ON p.email_id = emails.id
   LEFT JOIN ( SELECT attachments.reference_email_id AS email_id, 
    array_to_json(array_agg(attachments.id)) AS ids
   FROM attachments
  GROUP BY attachments.reference_email_id) atts ON atts.email_id = emails.id;

ALTER TABLE email_graphs
  OWNER TO paulcowan;

然后我们针对这个视图运行分页查询,例如

SELECT "email_graphs".* FROM "email_graphs" INNER JOIN "email_participants" ON ("email_participants"."email_id" = "email_graphs"."id") WHERE (("user_id" = 75) AND ("folder" = 'INBOX')) ORDER BY "sent_at" DESC LIMIT 5 OFFSET 0

随着表的增长,对该表的查询已大大减慢。

如果我使用 EXPLAIN ANALYZE 运行分页查询

EXPLAIN ANALYZE SELECT "email_graphs".* FROM "email_graphs" INNER JOIN "email_participants" ON ("email_participants"."email_id" = "email_graphs"."id") WHERE (("user_id" = 75) AND ("folder" = 'INBOX')) ORDER BY "sent_at" DESC LIMIT 5 OFFSET 0;

我得到这个结果

                                                           ->  Seq Scan on deals  (cost=0.00..9.11 rows=36 width=8) (actual time=0.003..0.044 rows=34 loops=1)
                                   ->  Sort  (cost=5.36..5.43 rows=131 width=36) (actual time=0.416..0.416 rows=1 loops=1)
                                         Sort Key: ms.email_id
                                         Sort Method: quicksort  Memory: 26kB
                                         ->  Subquery Scan on ms  (cost=3.52..4.44 rows=131 width=36) (actual time=0.408..0.411 rows=1 loops=1)
                                               ->  HashAggregate  (cost=3.52..4.05 rows=131 width=8) (actual time=0.406..0.408 rows=1 loops=1)
                                                     ->  Seq Scan on meetings  (cost=0.00..3.39 rows=131 width=8) (actual time=0.006..0.163 rows=161 loops=1)
                             ->  Sort  (cost=18.81..18.91 rows=199 width=36) (actual time=0.012..0.012 rows=0 loops=1)
                                   Sort Key: c.email_id
                                   Sort Method: quicksort  Memory: 25kB
                                   ->  Subquery Scan on c  (cost=15.90..17.29 rows=199 width=36) (actual time=0.007..0.007 rows=0 loops=1)
                                         ->  HashAggregate  (cost=15.90..16.70 rows=199 width=60) (actual time=0.006..0.006 rows=0 loops=1)
                                               ->  Seq Scan on comments  (cost=0.00..12.22 rows=736 width=60) (actual time=0.004..0.004 rows=0 loops=1)
                                                     Filter: (email_id IS NOT NULL)
                                                     Rows Removed by Filter: 2
                                               SubPlan 1
                                                 ->  Values Scan on "*VALUES*"  (cost=0.00..0.00 rows=1 width=56) (never executed)
                       ->  Materialize  (cost=4220.14..4883.55 rows=27275 width=36) (actual time=247.720..1189.545 rows=29516 loops=1)
                             ->  GroupAggregate  (cost=4220.14..4788.09 rows=27275 width=15) (actual time=247.715..1131.787 rows=29516 loops=1)
                                   ->  Sort  (cost=4220.14..4261.86 rows=83426 width=15) (actual time=247.634..339.376 rows=82632 loops=1)
                                         Sort Key: public.email_participants.email_id
                                         Sort Method: external sort  Disk: 1760kB
                                         ->  Seq Scan on email_participants  (cost=0.00..2856.28 rows=83426 width=15) (actual time=0.009..88.938 rows=82720 loops=1)
                                   SubPlan 2
                                     ->  Values Scan on "*VALUES*"  (cost=0.00..0.00 rows=1 width=40) (actual time=0.004..0.005 rows=1 loops=82631)
                 ->  Sort  (cost=2.01..2.01 rows=1 width=36) (actual time=0.074..0.077 rows=3 loops=1)
                       Sort Key: atts.email_id
                       Sort Method: quicksort  Memory: 25kB
                       ->  Subquery Scan on atts  (cost=2.00..2.01 rows=1 width=36) (actual time=0.048..0.060 rows=3 loops=1)
                             ->  HashAggregate  (cost=2.00..2.01 rows=1 width=8) (actual time=0.045..0.051 rows=3 loops=1)
                                   ->  Seq Scan on attachments  (cost=0.00..2.00 rows=1 width=8) (actual time=0.013..0.021 rows=5 loops=1)
           ->  Index Only Scan using email_participants_email_id_user_id_index on email_participants  (cost=0.00..990.04 rows=269 width=4) (actual time=1.357..2.886 rows=43 loops=1)
                 Index Cond: (user_id = 75)
                 Heap Fetches: 43

总运行时间:1642.157 毫秒(75 行)

4

2 回答 2

1

我绝对不是在寻找修复 :) 或重构查询。任何形式的高级建议都将受到欢迎。

根据我的评论,问题的要点在于相互连接的聚合。这可以防止使用索引,并在您的查询计划中产生一堆合并连接(和具体化)......</p>

换句话说,把它想象成一个如此复杂的计划,以至于 Postgres 通过在内存中实现临时表,然后重复排序它们直到它们都被适当地合并连接。从我所站的位置来看,完全的废话似乎相当于从所有表中选择所有行,以及它们所有可能的关系。一旦它被计算出来并被召唤出来,Postgres 就会继续对混乱进行排序,以提取前 n 行。

无论如何,您想要重写查询,以便它实际上可以使用索引开始。

部分原因很简单。例如,这是一个很大的禁忌:

select …,
       ts.ids AS todo_ids, cs.ids AS call_ids, 
       ds.ids AS deal_ids, ms.ids AS meeting_ids, c.comments, p.people, 
       atts.ids AS attachment_ids

在一个查询中获取电子邮件。使用一口大小的email_id in (…)子句在单独的查询中获取相关对象。仅仅这样做应该会加快速度。

对于其余部分,它可能很简单,也可能不简单,或者涉及对您的架构进行一些重新设计。我只浏览了这个难以理解的怪物及其令人毛骨悚然的查询计划,所以我无法确定。

于 2013-11-09T12:25:33.323 回答
1

我认为大视图不太可能表现良好,您应该将其分解为更易于管理的组件,但仍然有两个具体的建议浮现在脑海中:

架构更改

将文本和 html 正文移出主表。尽管大内容会自动存储在TOAST空间中,但邮件部分通常会小于 TOAST 阈值(~2000 字节),尤其是对于纯文本,因此不会系统地发生。

如果您认为表的主要目的是包含发件人、收件人、日期、主题等标头字段,则每个非 TOASTED 内容都会以不利于 I/O 性能和缓存的方式膨胀表。

我可以用我碰巧在邮件数据库中的内容对此进行测试。在我收件箱中的 55k 邮件样本中:

平均文本/纯文本大小:1511 字节。
平均文本/html 大小:11895 字节(但 42395 条消息根本没有 html)

没有正文的邮件表的大小:14Mb(无 TOAST)
如果将正文添加为另外 2 个 TEXT 列,如您所拥有的:主存储中为 59Mb,TOAST 中为 61Mb。

尽管有 TOAST,但主存储似乎要大 4 倍。因此,在不需要 TEXT 列的情况下扫描表时,80% 的 I/O 都被浪费了。未来的行更新可能会因碎片效应而使这种情况变得更糟。

可以通过pg_statio_all_tables视图发现块读取方面的效果(比较heap_blks_read + heap_blks_hit查询前后)

调音

EXPLAIN的这一部分

排序方法:外部排序磁盘:1760kB

提示你work_mem太小了。你不想为这么小的种类打磁盘。除非您的可用内存不足,否则将其设置为至少 10Mb。当您使用它时,shared_buffers如果它仍然是默认值,请设置为一个合理的值。有关更多信息,请参阅http://wiki.postgresql.org/wiki/Performance_Optimization

于 2013-11-12T00:16:47.993 回答