0

我目前正在研究数据跟踪系统。该系统是一个用 Python 编写的多进程应用程序,工作方式如下:

  1. 每 S 秒,它从数据库(当前为 Postgres)中选择 N 个最合适的任务并为其查找数据
  2. 如果没有任务,则创建 N 个新任务并返回 (1)。

问题如下 - 目前我有大约。80GB 的数据和 36M 的任务以及对任务表的查询开始变得越来越慢(它是填充最多和最常用的表)。

性能的主要瓶颈是任务跟踪查询:

LOCK TABLE task IN ACCESS EXCLUSIVE MODE;
SELECT * FROM task WHERE line = 1 AND action = ANY(ARRAY['Find', 'Get']) AND (stat IN ('', 'CR1') OR stat = 'ERROR' AND (actiondate <= NOW() OR actiondate IS NULL)) ORDER BY taskid, actiondate, action DESC, idtype, date ASC LIMIT 36;

                                    Table "public.task"
   Column   |            Type             |                    Modifiers
------------+-----------------------------+-------------------------------------------------
 number     | character varying(16)       | not null
 date       | timestamp without time zone | default now()
 stat       | character varying(16)       | not null default ''::character varying
 idtype     | character varying(16)       | not null default 'container'::character varying
 uri        | character varying(1024)     |
 action     | character varying(16)       | not null default 'Find'::character varying
 reason     | character varying(4096)     | not null default ''::character varying
 rev        | integer                     | not null default 0
 actiondate | timestamp without time zone |
 modifydate | timestamp without time zone |
 line       | integer                     |
 datasource | character varying(512)      |
 taskid     | character varying(32)       |
 found      | integer                     | not null default 0
Indexes:
    "task_pkey" PRIMARY KEY, btree (idtype, number)
    "action_index" btree (action)
    "actiondate_index" btree (actiondate)
    "date_index" btree (date)
    "line_index" btree (line)
    "modifydate_index" btree (modifydate)
    "stat_index" btree (stat)
    "taskid_index" btree (taskid)

                               QUERY PLAN                          
---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
 Limit  (cost=312638.87..312638.96 rows=36 width=668) (actual time=1838.193..1838.197 rows=36 loops=1)
   ->  Sort  (cost=312638.87..313149.54 rows=204267 width=668) (actual time=1838.192..1838.194 rows=36 loops=1)
         Sort Key: taskid, actiondate, action, idtype, date
         Sort Method: top-N heapsort  Memory: 43kB
         ->  Bitmap Heap Scan on task  (cost=107497.61..306337.31 rows=204267 width=668) (actual time=1013.491..1343.751 rows=914586 loops=1)
               Recheck Cond: ((((stat)::text = ANY ('{"",CR1}'::text[])) OR ((stat)::text = 'ERROR'::text)) AND (line = 1))
               Filter: (((action)::text = ANY ('{Find,Get}'::text[])) AND (((stat)::text = ANY ('{"",CR1}'::text[])) OR (((stat)::text = 'ERROR'::text) AND ((actiondate <= now()) OR (actiondate IS NULL)))))
               Rows Removed by Filter: 133
               Heap Blocks: exact=76064
               ->  BitmapAnd  (cost=107497.61..107497.61 rows=237348 width=0) (actual time=999.457..999.457 rows=0 loops=1)
                     ->  BitmapOr  (cost=9949.15..9949.15 rows=964044 width=0) (actual time=121.936..121.936 rows=0 loops=1)
                           ->  Bitmap Index Scan on stat_index  (cost=0.00..9449.46 rows=925379 width=0) (actual time=117.791..117.791 rows=920900 loops=1)
                                 Index Cond: ((stat)::text = ANY ('{"",CR1}'::text[]))
                           ->  Bitmap Index Scan on stat_index  (cost=0.00..397.55 rows=38665 width=0) (actual time=4.144..4.144 rows=30262 loops=1)
                                 Index Cond: ((stat)::text = 'ERROR'::text)
                     ->  Bitmap Index Scan on line_index  (cost=0.00..97497.14 rows=9519277 width=0) (actual time=853.033..853.033 rows=9605462 loops=1)
                           Index Cond: (line = 1)
 Planning time: 0.284 ms
 Execution time: 1838.882 ms
(19 rows)

当然,所有涉及的字段都被索引。我目前在考虑两个方向:

  1. 如何优化查询以及它实际上是否会给我带来透视性能改进(目前每个查询大约需要 10 秒,这在动态任务跟踪中是不可接受的)
  2. 在哪里以及如何更有效地存储任务数据 - 可能我应该为此目的使用另一个数据库 - Cassandra、VoltDB 或另一个大数据存储?

我认为应该以某种方式预先订购数据以尽快获得实际任务。

另外请记住,我目前的 80G 容量很可能是此类任务的最小值而不是最大值。

提前致谢!

4

1 回答 1

0

我不太了解您的用例,但在我看来,您的索引运行得不太好。看起来查询主要依赖于 stat 索引。我认为你需要研究一个复合索引,比如(action、line、stat)。

另一种选择是将数据分片到多个表中,将其拆分为具有低基数的某个键。我不使用 postgres,但我不认为查看另一个数据库解决方案会更好,除非你确切地知道你在优化什么。

于 2016-06-11T02:28:02.550 回答