1

我有一个 Django 应用程序,它连接到一个 PostgreSQL 9.1.9 数据库,该数据库运行在它自己的具有 2gb RAM 的专用机器上。该数据库存储了 Twitter 推文的缓存(大约 100 万条),并根据它们包含的词对它们进行索引。以下是 2 个相关模型:

class TwitterPassage(models.Model):
    third_party_id = models.CharField(max_length=STANDARD_MAX_LEN, db_index=True, unique=True)
    third_party_created = models.DateTimeField(null=True, db_index=True)
    source = models.CharField(max_length=STANDARD_MAX_LEN)
    text = models.CharField(max_length=STANDARD_MAX_LEN)
    author = models.CharField(max_length=STANDARD_MAX_LEN)
    words = models.ManyToManyField('connectr.Word')
    quality = models.BigIntegerField(null=True, blank=True, db_index=True)
    author_fk = models.ForeignKey('connectr.TwitterUser', null=True)

class Word(models.Model):
    word = models.CharField(max_length=STANDARD_MAX_LEN, db_index=True, unique=True)
    display_word = models.CharField(max_length=STANDARD_MAX_LEN, default='', blank=True)
    passage_count = models.IntegerField(null=True, db_index=True, blank=True)

class User(models.Model):
    user_id = models.CharField(max_length=STANDARD_MAX_LEN, db_index=True)
    tweet_passages = models.ManyToManyField('connectr.TwitterPassage', through='connectr.PassageViewEvent')

Word 与包含该单词的任何 TwitterPassage 具有多对多关系。

我运行的查询是:

# Exclude tweets this user has already seen, and find the 20 highest quality tweets they haven't yet seen
word.twitterpassage_set.exclude(user=current_user).order_by('-quality')[:20]

质量是一个整数分数,范围从大约 0 到 300。

发生的情况是,有时这个查询很快就像我需要的那样(不到一秒钟)。但其他时候,它非常缓慢——长达 10 秒。它似乎特别适用于真正常见的词,例如“他们的”或“我的”,而对于与较少的 TwitterPassages 相关的稀有词则效果不佳。

我有 8 个字段索引 TwitterPassage 模型和 5 个 Word 模型。这只是表明我需要更多 RAM 或更少索引吗?我将如何确定其中哪些可以解决问题?

此外,如果有帮助,这里有一些关于数据库大小的信息:

                            relation                                |  size   
------------------------------------------------------------------------+---------
 public.connectr_twitterpassage_words_word_id                           | 1680 MB
 public.connectr_twitterpassage_twitterpassage_id_613c80271f09fba8_uniq | 1199 MB
 public.connectr_twitterpassage_words_pkey                              | 1010 MB
 public.connectr_twitterpassage_words                                   | 1009 MB
 public.connectr_twitterpassage_words_twitterpassage_id                 | 1002 MB
 public.connectr_twitterpassage                                         | 620 MB
 public.connectr_twitteruser                                            | 449 MB
 public.connectr_twitterpassage_created                                 | 256 MB
 public.connectr_passage_source_like                                    | 230 MB
 public.connectr_passage_source                                         | 229 MB
 public.connectr_twitterpassage_is_top_tweet                            | 194 MB
 public.connectr_passage_pkey                                           | 187 MB
 public.connectr_word                                                   | 184 MB
 public.connectr_passage_third_party_id_like                            | 181 MB
 public.connectr_passage_third_party_id                                 | 180 MB
 public.connectr_passage_retweet_count                                  | 170 MB
 public.connectr_twitterpassage_third_party_id_uniq                     | 168 MB
 public.connectr_passage_favorited_count                                | 166 MB
 public.connectr_twitterpassage_quality                                 | 159 MB
 public.connectr_twitterpassage_author_fk_id                            | 118 MB

编辑:根据 Jakub 的建议,这是查询的解释分析:

 Limit  (cost=37918.71..37918.72 rows=20 width=204) (actual time=1495.133..1495.201 rows=20 loops=1)
   ->  Sort  (cost=37918.71..37919.01 rows=606 width=204) (actual time=1495.129..1495.156 rows=20 loops=1)
         Sort Key: connectr_twitterpassage.quality
         Sort Method: top-N heapsort  Memory: 24kB
         ->  Nested Loop  (cost=18.35..37915.49 rows=606 width=204) (actual time=0.301..1485.234 rows=1249 loops=1)
               ->  Index Scan using connectr_twitterpassage_words_word_id on connectr_twitterpassage_words  (cost=0.00..4905.80 rows=1212 width=4) (actual time=0.091..812.018 rows=1249 loops=1)
                     Index Cond: (word_id = 18890456)
               ->  Index Scan using connectr_passage_pkey on connectr_twitterpassage  (cost=18.35..27.23 rows=1 width=204) (actual time=0.515..0.525 rows=1 loops=1249)
                     Index Cond: (id = connectr_twitterpassage_words.twitterpassage_id)
                     Filter: ((NOT (hashed SubPlan 1)) OR (id IS NULL))
                     SubPlan 1
                       ->  Index Scan using connectr_passageviewevent_user_id on connectr_passageviewevent u1  (cost=0.00..18.34 rows=6 width=4) (actual time=0.033..0.091 rows=5 loops=1)
                             Index Cond: (user_id = 1)
                             Filter: (passage_id IS NOT NULL)
 Total runtime: 1495.700 ms
(15 rows)

在对几个不同的词运行上述查询后,一些词非常快(~200 毫秒),而另一些则慢得多(~1500 毫秒或更多)。如果我多次运行相同的查询,第二次会更快(我猜它被缓存了?)。

以下是表定义:

                                       Table "public.connectr_word"
       Column        |           Type           |                         Modifiers                          
---------------------+--------------------------+------------------------------------------------------------
 id                  | integer                  | not null default nextval('connectr_word_id_seq'::regclass)
 word                | character varying(10000) | not null
 created             | timestamp with time zone | not null
 modified            | timestamp with time zone | not null
 frequency           | double precision         | 
 is_username         | boolean                  | not null
 is_hashtag          | boolean                  | not null
 cloud_eligible      | boolean                  | not null
 passage_count       | integer                  | 
 avg_quality         | double precision         | 
 last_twitter_search | timestamp with time zone | 
 cloud_approved      | boolean                  | not null
 display_word        | character varying(10000) | not null
 is_trend            | boolean                  | not null
Indexes:
    "connectr_word_pkey" PRIMARY KEY, btree (id)
    "connectr_word_word_uniq" UNIQUE CONSTRAINT, btree (word)
    "connectr_word_avg_quality" btree (avg_quality)
    "connectr_word_cloud_eligible" btree (cloud_eligible)
    "connectr_word_last_twitter_search" btree (last_twitter_search)
    "connectr_word_passage_count" btree (passage_count)
    "connectr_word_word" btree (word)
Referenced by:
    TABLE "connectr_passageviewevent" CONSTRAINT "source_word_id_refs_id_178d46eb" FOREIGN KEY (source_word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "connectr_wordmatchrewardevent" CONSTRAINT "tapped_word_id_refs_id_c2ffb369" FOREIGN KEY (tapped_word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "connectr_connection" CONSTRAINT "word_id_refs_id_00cccde2" FOREIGN KEY (word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "connectr_twitterpassage_words" CONSTRAINT "word_id_refs_id_64f49629" FOREIGN KEY (word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED


                                         Table "public.connectr_twitterpassage"
         Column         |           Type           |                              Modifiers                               
------------------------+--------------------------+----------------------------------------------------------------------
 id                     | integer                  | not null default nextval('connectr_twitterpassage_id_seq'::regclass)
 third_party_id         | character varying(10000) | not null
 source                 | character varying(10000) | not null
 text                   | character varying(10000) | not null
 author                 | character varying(10000) | not null
 raw_data               | character varying(10000) | not null
 created                | timestamp with time zone | not null
 modified               | timestamp with time zone | not null
 third_party_created    | timestamp with time zone | 
 retweet_count          | integer                  | not null
 favorited_count        | integer                  | not null
 lang                   | character varying(10000) | not null
 location               | character varying(10000) | not null
 author_followers_count | integer                  | not null
 is_retweet             | boolean                  | not null
 url                    | character varying(10000) | not null
 author_fk_id           | integer                  | 
 quality                | bigint                   | 
 is_top_tweet           | boolean                  | not null
Indexes:
    "connectr_passage_pkey" PRIMARY KEY, btree (id)
    "connectr_twitterpassage_third_party_id_uniq" UNIQUE CONSTRAINT, btree (third_party_id)
    "connectr_passage_author_followers_count" btree (author_followers_count)
    "connectr_passage_favorited_count" btree (favorited_count)
    "connectr_passage_retweet_count" btree (retweet_count)
    "connectr_passage_source" btree (source)
    "connectr_passage_source_like" btree (source varchar_pattern_ops)
    "connectr_passage_third_party_id" btree (third_party_id)
    "connectr_passage_third_party_id_like" btree (third_party_id varchar_pattern_ops)
    "connectr_twitterpassage_author_fk_id" btree (author_fk_id)
    "connectr_twitterpassage_created" btree (created)
    "connectr_twitterpassage_is_top_tweet" btree (is_top_tweet)
    "connectr_twitterpassage_quality" btree (quality)
    "connectr_twitterpassage_third_party_created" btree (third_party_created)
Foreign-key constraints:
    "author_fk_id_refs_id_074720a5" FOREIGN KEY (author_fk_id) REFERENCES connectr_twitteruser(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "connectr_passageviewevent" CONSTRAINT "passage_id_refs_id_892b36a6" FOREIGN KEY (passage_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "connectr_connection" CONSTRAINT "twitter_from_id_refs_id_8adbab24" FOREIGN KEY (twitter_from_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "connectr_connection" CONSTRAINT "twitter_to_id_refs_id_8adbab24" FOREIGN KEY (twitter_to_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "connectr_twitterpassage_words" CONSTRAINT "twitterpassage_id_refs_id_720f772f" FOREIGN KEY (twitterpassage_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED


                                           Table "public.connectr_user"
           Column           |           Type           |                         Modifiers                          
----------------------------+--------------------------+------------------------------------------------------------
 id                         | integer                  | not null default nextval('connectr_user_id_seq'::regclass)
 user_id                    | character varying(10000) | not null
 reference_name             | character varying(10000) | not null
 created                    | timestamp with time zone | 
 modified                   | timestamp with time zone | 
 score                      | integer                  | not null
 twitter_screen_name        | character varying(10000) | not null
 twitter_oauth_token        | character varying(10000) | not null
 twitter_oauth_token_secret | character varying(10000) | not null
 twitter_keys_last_used     | timestamp with time zone | not null
Indexes:
    "connectr_user_pkey" PRIMARY KEY, btree (id)
    "connectr_user_score" btree (score)
    "connectr_user_user_id" btree (user_id)
    "connectr_user_user_id_like" btree (user_id varchar_pattern_ops)
Referenced by:
    TABLE "connectr_connection" CONSTRAINT "user_id_refs_id_366cf6e8" FOREIGN KEY (user_id) REFERENCES connectr_user(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "connectr_passageviewevent" CONSTRAINT "user_id_refs_id_478f94a2" FOREIGN KEY (user_id) REFERENCES connectr_user(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "connectr_user_reddit_passages" CONSTRAINT "user_id_refs_id_488fdfea" FOREIGN KEY (user_id) REFERENCES connectr_user(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "connectr_wordmatchrewardevent" CONSTRAINT "user_id_refs_id_8a36f38a" FOREIGN KEY (user_id) REFERENCES connectr_user(id) DEFERRABLE INITIALLY DEFERRED
    TABLE "connectr_user_book_passages" CONSTRAINT "user_id_refs_id_e830956b" FOREIGN KEY (user_id) REFERENCES connectr_user(id) DEFERRABLE INITIALLY DEFERRED

                                      Table "public.connectr_passageviewevent"
     Column     |           Type           |                               Modifiers                                
----------------+--------------------------+------------------------------------------------------------------------
 id             | integer                  | not null default nextval('connectr_passageviewevent_id_seq'::regclass)
 passage_id     | integer                  | not null
 user_id        | integer                  | not null
 source_word_id | integer                  | not null
 next_id        | integer                  | 
 connection_id  | integer                  | 
 date           | timestamp with time zone | not null
Indexes:
    "connectr_passageviewevent_pkey" PRIMARY KEY, btree (id)
    "connectr_passageviewevent_connection_id" btree (connection_id)
    "connectr_passageviewevent_date" btree (date)
    "connectr_passageviewevent_next_id" btree (next_id)
    "connectr_passageviewevent_passage_id" btree (passage_id)
    "connectr_passageviewevent_source_word_id" btree (source_word_id)
    "connectr_passageviewevent_user_id" btree (user_id)
Foreign-key constraints:
    "connection_id_refs_id_a3ff7fc2" FOREIGN KEY (connection_id) REFERENCES connectr_connection(id) DEFERRABLE INITIALLY DEFERRED
    "next_id_refs_id_f737727c" FOREIGN KEY (next_id) REFERENCES connectr_passageviewevent(id) DEFERRABLE INITIALLY DEFERRED
    "passage_id_refs_id_892b36a6" FOREIGN KEY (passage_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
    "source_word_id_refs_id_178d46eb" FOREIGN KEY (source_word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
    "user_id_refs_id_478f94a2" FOREIGN KEY (user_id) REFERENCES connectr_user(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
    TABLE "connectr_passageviewevent" CONSTRAINT "next_id_refs_id_f737727c" FOREIGN KEY (next_id) REFERENCES connectr_passageviewevent(id) DEFERRABLE INITIALLY DEFERRED

这是查询的原始 SQL,它(有时)很慢(由 Django 生成):

SELECT "connectr_twitterpassage"."id", "connectr_twitterpassage"."third_party_id", "connectr_twitterpassage"."third_party_created", "connectr_twitterpassage"."source", "connectr_twitterpassage"."text", "connectr_twitterpassage"."author", "connectr_twitterpassage"."raw_data", "connectr_twitterpassage"."retweet_count", "connectr_twitterpassage"."favorited_count", "connectr_twitterpassage"."lang", "connectr_twitterpassage"."location", "connectr_twitterpassage"."author_followers_count", "connectr_twitterpassage"."is_retweet", "connectr_twitterpassage"."url", "connectr_twitterpassage"."author_fk_id", "connectr_twitterpassage"."quality", "connectr_twitterpassage"."is_top_tweet", "connectr_twitterpassage"."created", "connectr_twitterpassage"."modified" 
    FROM "connectr_twitterpassage" INNER JOIN "connectr_twitterpassage_words" 
    ON ("connectr_twitterpassage"."id" = "connectr_twitterpassage_words"."twitterpassage_id") 
    WHERE ("connectr_twitterpassage_words"."word_id" = 19514309  
    AND NOT (("connectr_twitterpassage"."id" 
    IN (SELECT U1."passage_id" FROM "connectr_passageviewevent" U1 WHERE (U1."user_id" = 1  AND U1."passage_id" IS NOT NULL)) AND "connectr_twitterpassage"."id" IS NOT NULL))) 
    ORDER BY "connectr_twitterpassage"."quality" DESC LIMIT 20

添加这些索引后:

create index word_to_twitterpassage_id on connectr_twitterpassage_words (word_id,twitterpassage_id);
create index id_to_quality_sorted on connectr_twitterpassage (id,quality desc nulls last);

解释分析现在是这样的:

 Limit  (cost=34679.26..34679.31 rows=20 width=206) (actual time=7.883..7.887 rows=20 loops=1)
   ->  Sort  (cost=34679.26..34681.02 rows=704 width=206) (actual time=7.882..7.884 rows=20 loops=1)
         Sort Key: connectr_twitterpassage.quality
         Sort Method: top-N heapsort  Memory: 32kB
         ->  Nested Loop  (cost=16.86..34660.53 rows=704 width=206) (actual time=2.669..7.618 rows=102 loops=1)
               ->  Index Only Scan using word_to_twitterpassage_id on connectr_twitterpassage_words  (cost=0.00..67.21 rows=1408 width=4) (actual time=2.493..3.094 rows=102 loops=1)
                     Index Cond: (word_id = 18860699)
                     Heap Fetches: 1
               ->  Index Scan using connectr_passage_pkey on connectr_twitterpassage  (cost=16.86..24.56 rows=1 width=206) (actual time=0.042..0.043 rows=1 loops=102)
                     Index Cond: (id = connectr_twitterpassage_words.twitterpassage_id)
                     Filter: ((NOT (hashed SubPlan 1)) OR (id IS NULL))
                     SubPlan 1
                       ->  Bitmap Heap Scan on connectr_passageviewevent u1  (cost=4.46..16.80 rows=27 width=4) (actual time=0.049..0.066 rows=25 loops=1)
                             Recheck Cond: (user_id = 1)
                             Filter: (passage_id IS NOT NULL)
                             ->  Bitmap Index Scan on connectr_passageviewevent_user_id  (cost=0.00..4.45 rows=27 width=0) (actual time=0.037..0.037 rows=26 loops=1)
                                   Index Cond: (user_id = 1)
 Total runtime: 8.042 ms
(18 rows)
4

1 回答 1

0

如前所述,您的问题是由于缺少外键索引导致基于其他条件的嵌套循环连接。添加这些索引解决了您的问题。

一般来说,在 PostgreSQL 中,您应该始终索引任何非平凡大小的表上的所有外键(主键和唯一字段被自动索引),然后添加您真正需要的任何其他索引。在这种情况下,您缺少外键索引。

于 2013-11-28T15:03:55.207 回答