postgresql - 使 Postgres 查询更快。更多索引？

Question

我正在运行 Geodjango/Postgres 9.1/PostGIS，并且试图让以下查询（以及其他类似的查询）运行得更快。

[为简洁起见，查询被截断]

SELECT "crowdbreaks_incomingkeyword"."keyword_id"
       , COUNT("crowdbreaks_incomingkeyword"."keyword_id") AS "cnt" 
  FROM "crowdbreaks_incomingkeyword"
 INNER JOIN "crowdbreaks_tweet"
       ON ("crowdbreaks_incomingkeyword"."tweet_id"
          = "crowdbreaks_tweet"."tweet_id")
  LEFT OUTER JOIN "crowdbreaks_place"
    ON ("crowdbreaks_tweet"."place_id"
       = "crowdbreaks_place"."place_id") 
 WHERE (("crowdbreaks_tweet"."coordinates"
        @ ST_GeomFromEWKB(E'\\001 ... \\000\\000\\000\\0008@'::bytea)
       OR ST_Overlaps("crowdbreaks_place"."bounding_box"
                     , ST_GeomFromEWKB(E'\\001...00\\000\\0008@'::bytea)
       )) 
   AND "crowdbreaks_tweet"."created_at" > E'2012-04-17 15:46:12.109893'
   AND "crowdbreaks_tweet"."created_at" < E'2012-04-18 15:46:12.109899' ) 
 GROUP BY "crowdbreaks_incomingkeyword"."keyword_id"
         , "crowdbreaks_incomingkeyword"."keyword_id"
    ;

下面是 crowdbreaks_tweet 表的样子：

\d+ crowdbreaks_tweet;
                       Table "public.crowdbreaks_tweet"
    Column     |           Type           | Modifiers | Storage  | Description 
---------------+--------------------------+-----------+----------+-------------
 tweet_id      | bigint                   | not null  | plain    | 
 tweeter       | bigint                   | not null  | plain    | 
 text          | text                     | not null  | extended | 
 created_at    | timestamp with time zone | not null  | plain    | 
 country_code  | character varying(3)     |           | extended | 
 place_id      | character varying(32)    |           | extended | 
 coordinates   | geometry                 |           | main     | 
Indexes:
    "crowdbreaks_tweet_pkey" PRIMARY KEY, btree (tweet_id)
    "crowdbreaks_tweet_coordinates_id" gist (coordinates)
    "crowdbreaks_tweet_created_at" btree (created_at)
    "crowdbreaks_tweet_place_id" btree (place_id)
    "crowdbreaks_tweet_place_id_like" btree (place_id varchar_pattern_ops)
Check constraints:
    "enforce_dims_coordinates" CHECK (st_ndims(coordinates) = 2)
    "enforce_geotype_coordinates" CHECK (geometrytype(coordinates) = 'POINT'::text OR coordinates IS NULL)
    "enforce_srid_coordinates" CHECK (st_srid(coordinates) = 4326)
Foreign-key constraints:
    "crowdbreaks_tweet_place_id_fkey" FOREIGN KEY (place_id) REFERENCES crowdbreaks_place(place_id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "crowdbreaks_incomingkeyword" CONSTRAINT "crowdbreaks_incomingkeyword_tweet_id_fkey" FOREIGN KEY (tweet_id) REFERENCES crowdbreaks_tweet(tweet_id) DEFERRABLE INITIALLY DEFERRED
    TABLE "crowdbreaks_tweetanswer" CONSTRAINT "crowdbreaks_tweetanswer_tweet_id_id_fkey" FOREIGN KEY (tweet_id_id) REFERENCES crowdbreaks_tweet(tweet_id) DEFERRABLE INITIALLY DEFERRED
Has OIDs: no

这是查询的解释分析：

 HashAggregate  (cost=184022.03..184023.18 rows=115 width=4) (actual time=6381.707..6381.769 rows=62 loops=1)
   ->  Hash Join  (cost=103857.48..183600.24 rows=84357 width=4) (actual time=1745.449..6377.505 rows=3453 loops=1)
         Hash Cond: (crowdbreaks_incomingkeyword.tweet_id = crowdbreaks_tweet.tweet_id)
         ->  Seq Scan on crowdbreaks_incomingkeyword  (cost=0.00..36873.97 rows=2252597 width=12) (actual time=0.008..2136.839 rows=2252597 loops=1)
         ->  Hash  (cost=102535.68..102535.68 rows=80544 width=8) (actual time=1744.815..1744.815 rows=3091 loops=1)
               Buckets: 4096  Batches: 4  Memory Usage: 32kB
               ->  Hash Left Join  (cost=16574.93..102535.68 rows=80544 width=8) (actual time=112.551..1740.651 rows=3091 loops=1)
                     Hash Cond: ((crowdbreaks_tweet.place_id)::text = (crowdbreaks_place.place_id)::text)
                     Filter: ((crowdbreaks_tweet.coordinates @ '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) OR ((crowdbreaks_place.bounding_box && '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry) AND _st_overlaps(crowdbreaks_place.bounding_box, '0103000020E61000000100000005000000AE47E17A141E5FC00000000000003840AE47E17A141E5FC029ED0DBE30B14840A4703D0AD7A350C029ED0DBE30B14840A4703D0AD7A350C00000000000003840AE47E17A141E5FC00000000000003840'::geometry)))
                     ->  Bitmap Heap Scan on crowdbreaks_tweet  (cost=15874.18..67060.28 rows=747873 width=125) (actual time=96.012..940.462 rows=736784 loops=1)
                           Recheck Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
                           ->  Bitmap Index Scan on crowdbreaks_tweet_crreated_at  (cost=0.00..15687.22 rows=747873 width=0) (actual time=94.259..94.259 rows=736784 loops=1)
                                 Index Cond: ((created_at > '2012-04-17 15:46:12.109893+00'::timestamp with time zone) AND (created_at < '2012-04-18 15:46:12.109899+00'::timestamp with time zone))
                     ->  Hash  (cost=217.11..217.11 rows=6611 width=469) (actual time=15.926..15.926 rows=6611 loops=1)
                           Buckets: 1024  Batches: 4  Memory Usage: 259kB
                           ->  Seq Scan on crowdbreaks_place  (cost=0.00..217.11 rows=6611 width=469) (actual time=0.005..6.908 rows=6611 loops=1)
 Total runtime: 6381.903 ms
(17 rows)

对于查询来说，这是一个非常糟糕的运行时。理想情况下，我希望在一两秒内得到结果。

我已将 Postgres 上的 shared_buffers 增加到 2GB（我有 8GB 的 RAM），但除此之外我不太确定该怎么做。我有哪些选择？我应该做更少的连接吗？我可以在那里添加任何其他索引吗？对 crowdbreaks_incomingkeyword 的顺序扫描对我来说没有意义。它是其他表的外键表，因此上面有索引。

score 5 · Accepted Answer

从您的评论来看，我会尝试两件事：

提高相关列的统计目标（并运行ANALYZE）。

ALTER TABLE tbl ALTER COLUMN column SET STATISTICS 1000;

数据分布可能不均匀。更大的样本可以为查询规划器提供更准确的估计。

使用中的成本设置。postgresql.conf与索引扫描相比，您的顺序扫描可能需要更昂贵才能提供良好的估计。

尝试降低专用数据库服务器的成本cpu_index_tuple_cost并将其设置effective_cache_size为总 RAM 的四分之三。

postgresql - 使 Postgres 查询更快。更多索引？

1 回答 1

Related

Reference