我有一个 Django 应用程序,它连接到一个 PostgreSQL 9.1.9 数据库,该数据库运行在它自己的具有 2gb RAM 的专用机器上。该数据库存储了 Twitter 推文的缓存(大约 100 万条),并根据它们包含的词对它们进行索引。以下是 2 个相关模型:
class TwitterPassage(models.Model):
third_party_id = models.CharField(max_length=STANDARD_MAX_LEN, db_index=True, unique=True)
third_party_created = models.DateTimeField(null=True, db_index=True)
source = models.CharField(max_length=STANDARD_MAX_LEN)
text = models.CharField(max_length=STANDARD_MAX_LEN)
author = models.CharField(max_length=STANDARD_MAX_LEN)
words = models.ManyToManyField('connectr.Word')
quality = models.BigIntegerField(null=True, blank=True, db_index=True)
author_fk = models.ForeignKey('connectr.TwitterUser', null=True)
class Word(models.Model):
word = models.CharField(max_length=STANDARD_MAX_LEN, db_index=True, unique=True)
display_word = models.CharField(max_length=STANDARD_MAX_LEN, default='', blank=True)
passage_count = models.IntegerField(null=True, db_index=True, blank=True)
class User(models.Model):
user_id = models.CharField(max_length=STANDARD_MAX_LEN, db_index=True)
tweet_passages = models.ManyToManyField('connectr.TwitterPassage', through='connectr.PassageViewEvent')
Word 与包含该单词的任何 TwitterPassage 具有多对多关系。
我运行的查询是:
# Exclude tweets this user has already seen, and find the 20 highest quality tweets they haven't yet seen
word.twitterpassage_set.exclude(user=current_user).order_by('-quality')[:20]
质量是一个整数分数,范围从大约 0 到 300。
发生的情况是,有时这个查询很快就像我需要的那样(不到一秒钟)。但其他时候,它非常缓慢——长达 10 秒。它似乎特别适用于真正常见的词,例如“他们的”或“我的”,而对于与较少的 TwitterPassages 相关的稀有词则效果不佳。
我有 8 个字段索引 TwitterPassage 模型和 5 个 Word 模型。这只是表明我需要更多 RAM 或更少索引吗?我将如何确定其中哪些可以解决问题?
此外,如果有帮助,这里有一些关于数据库大小的信息:
relation | size
------------------------------------------------------------------------+---------
public.connectr_twitterpassage_words_word_id | 1680 MB
public.connectr_twitterpassage_twitterpassage_id_613c80271f09fba8_uniq | 1199 MB
public.connectr_twitterpassage_words_pkey | 1010 MB
public.connectr_twitterpassage_words | 1009 MB
public.connectr_twitterpassage_words_twitterpassage_id | 1002 MB
public.connectr_twitterpassage | 620 MB
public.connectr_twitteruser | 449 MB
public.connectr_twitterpassage_created | 256 MB
public.connectr_passage_source_like | 230 MB
public.connectr_passage_source | 229 MB
public.connectr_twitterpassage_is_top_tweet | 194 MB
public.connectr_passage_pkey | 187 MB
public.connectr_word | 184 MB
public.connectr_passage_third_party_id_like | 181 MB
public.connectr_passage_third_party_id | 180 MB
public.connectr_passage_retweet_count | 170 MB
public.connectr_twitterpassage_third_party_id_uniq | 168 MB
public.connectr_passage_favorited_count | 166 MB
public.connectr_twitterpassage_quality | 159 MB
public.connectr_twitterpassage_author_fk_id | 118 MB
编辑:根据 Jakub 的建议,这是查询的解释分析:
Limit (cost=37918.71..37918.72 rows=20 width=204) (actual time=1495.133..1495.201 rows=20 loops=1)
-> Sort (cost=37918.71..37919.01 rows=606 width=204) (actual time=1495.129..1495.156 rows=20 loops=1)
Sort Key: connectr_twitterpassage.quality
Sort Method: top-N heapsort Memory: 24kB
-> Nested Loop (cost=18.35..37915.49 rows=606 width=204) (actual time=0.301..1485.234 rows=1249 loops=1)
-> Index Scan using connectr_twitterpassage_words_word_id on connectr_twitterpassage_words (cost=0.00..4905.80 rows=1212 width=4) (actual time=0.091..812.018 rows=1249 loops=1)
Index Cond: (word_id = 18890456)
-> Index Scan using connectr_passage_pkey on connectr_twitterpassage (cost=18.35..27.23 rows=1 width=204) (actual time=0.515..0.525 rows=1 loops=1249)
Index Cond: (id = connectr_twitterpassage_words.twitterpassage_id)
Filter: ((NOT (hashed SubPlan 1)) OR (id IS NULL))
SubPlan 1
-> Index Scan using connectr_passageviewevent_user_id on connectr_passageviewevent u1 (cost=0.00..18.34 rows=6 width=4) (actual time=0.033..0.091 rows=5 loops=1)
Index Cond: (user_id = 1)
Filter: (passage_id IS NOT NULL)
Total runtime: 1495.700 ms
(15 rows)
在对几个不同的词运行上述查询后,一些词非常快(~200 毫秒),而另一些则慢得多(~1500 毫秒或更多)。如果我多次运行相同的查询,第二次会更快(我猜它被缓存了?)。
以下是表定义:
Table "public.connectr_word"
Column | Type | Modifiers
---------------------+--------------------------+------------------------------------------------------------
id | integer | not null default nextval('connectr_word_id_seq'::regclass)
word | character varying(10000) | not null
created | timestamp with time zone | not null
modified | timestamp with time zone | not null
frequency | double precision |
is_username | boolean | not null
is_hashtag | boolean | not null
cloud_eligible | boolean | not null
passage_count | integer |
avg_quality | double precision |
last_twitter_search | timestamp with time zone |
cloud_approved | boolean | not null
display_word | character varying(10000) | not null
is_trend | boolean | not null
Indexes:
"connectr_word_pkey" PRIMARY KEY, btree (id)
"connectr_word_word_uniq" UNIQUE CONSTRAINT, btree (word)
"connectr_word_avg_quality" btree (avg_quality)
"connectr_word_cloud_eligible" btree (cloud_eligible)
"connectr_word_last_twitter_search" btree (last_twitter_search)
"connectr_word_passage_count" btree (passage_count)
"connectr_word_word" btree (word)
Referenced by:
TABLE "connectr_passageviewevent" CONSTRAINT "source_word_id_refs_id_178d46eb" FOREIGN KEY (source_word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_wordmatchrewardevent" CONSTRAINT "tapped_word_id_refs_id_c2ffb369" FOREIGN KEY (tapped_word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_connection" CONSTRAINT "word_id_refs_id_00cccde2" FOREIGN KEY (word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_twitterpassage_words" CONSTRAINT "word_id_refs_id_64f49629" FOREIGN KEY (word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
Table "public.connectr_twitterpassage"
Column | Type | Modifiers
------------------------+--------------------------+----------------------------------------------------------------------
id | integer | not null default nextval('connectr_twitterpassage_id_seq'::regclass)
third_party_id | character varying(10000) | not null
source | character varying(10000) | not null
text | character varying(10000) | not null
author | character varying(10000) | not null
raw_data | character varying(10000) | not null
created | timestamp with time zone | not null
modified | timestamp with time zone | not null
third_party_created | timestamp with time zone |
retweet_count | integer | not null
favorited_count | integer | not null
lang | character varying(10000) | not null
location | character varying(10000) | not null
author_followers_count | integer | not null
is_retweet | boolean | not null
url | character varying(10000) | not null
author_fk_id | integer |
quality | bigint |
is_top_tweet | boolean | not null
Indexes:
"connectr_passage_pkey" PRIMARY KEY, btree (id)
"connectr_twitterpassage_third_party_id_uniq" UNIQUE CONSTRAINT, btree (third_party_id)
"connectr_passage_author_followers_count" btree (author_followers_count)
"connectr_passage_favorited_count" btree (favorited_count)
"connectr_passage_retweet_count" btree (retweet_count)
"connectr_passage_source" btree (source)
"connectr_passage_source_like" btree (source varchar_pattern_ops)
"connectr_passage_third_party_id" btree (third_party_id)
"connectr_passage_third_party_id_like" btree (third_party_id varchar_pattern_ops)
"connectr_twitterpassage_author_fk_id" btree (author_fk_id)
"connectr_twitterpassage_created" btree (created)
"connectr_twitterpassage_is_top_tweet" btree (is_top_tweet)
"connectr_twitterpassage_quality" btree (quality)
"connectr_twitterpassage_third_party_created" btree (third_party_created)
Foreign-key constraints:
"author_fk_id_refs_id_074720a5" FOREIGN KEY (author_fk_id) REFERENCES connectr_twitteruser(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
TABLE "connectr_passageviewevent" CONSTRAINT "passage_id_refs_id_892b36a6" FOREIGN KEY (passage_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_connection" CONSTRAINT "twitter_from_id_refs_id_8adbab24" FOREIGN KEY (twitter_from_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_connection" CONSTRAINT "twitter_to_id_refs_id_8adbab24" FOREIGN KEY (twitter_to_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_twitterpassage_words" CONSTRAINT "twitterpassage_id_refs_id_720f772f" FOREIGN KEY (twitterpassage_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
Table "public.connectr_user"
Column | Type | Modifiers
----------------------------+--------------------------+------------------------------------------------------------
id | integer | not null default nextval('connectr_user_id_seq'::regclass)
user_id | character varying(10000) | not null
reference_name | character varying(10000) | not null
created | timestamp with time zone |
modified | timestamp with time zone |
score | integer | not null
twitter_screen_name | character varying(10000) | not null
twitter_oauth_token | character varying(10000) | not null
twitter_oauth_token_secret | character varying(10000) | not null
twitter_keys_last_used | timestamp with time zone | not null
Indexes:
"connectr_user_pkey" PRIMARY KEY, btree (id)
"connectr_user_score" btree (score)
"connectr_user_user_id" btree (user_id)
"connectr_user_user_id_like" btree (user_id varchar_pattern_ops)
Referenced by:
TABLE "connectr_connection" CONSTRAINT "user_id_refs_id_366cf6e8" FOREIGN KEY (user_id) REFERENCES connectr_user(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_passageviewevent" CONSTRAINT "user_id_refs_id_478f94a2" FOREIGN KEY (user_id) REFERENCES connectr_user(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_user_reddit_passages" CONSTRAINT "user_id_refs_id_488fdfea" FOREIGN KEY (user_id) REFERENCES connectr_user(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_wordmatchrewardevent" CONSTRAINT "user_id_refs_id_8a36f38a" FOREIGN KEY (user_id) REFERENCES connectr_user(id) DEFERRABLE INITIALLY DEFERRED
TABLE "connectr_user_book_passages" CONSTRAINT "user_id_refs_id_e830956b" FOREIGN KEY (user_id) REFERENCES connectr_user(id) DEFERRABLE INITIALLY DEFERRED
Table "public.connectr_passageviewevent"
Column | Type | Modifiers
----------------+--------------------------+------------------------------------------------------------------------
id | integer | not null default nextval('connectr_passageviewevent_id_seq'::regclass)
passage_id | integer | not null
user_id | integer | not null
source_word_id | integer | not null
next_id | integer |
connection_id | integer |
date | timestamp with time zone | not null
Indexes:
"connectr_passageviewevent_pkey" PRIMARY KEY, btree (id)
"connectr_passageviewevent_connection_id" btree (connection_id)
"connectr_passageviewevent_date" btree (date)
"connectr_passageviewevent_next_id" btree (next_id)
"connectr_passageviewevent_passage_id" btree (passage_id)
"connectr_passageviewevent_source_word_id" btree (source_word_id)
"connectr_passageviewevent_user_id" btree (user_id)
Foreign-key constraints:
"connection_id_refs_id_a3ff7fc2" FOREIGN KEY (connection_id) REFERENCES connectr_connection(id) DEFERRABLE INITIALLY DEFERRED
"next_id_refs_id_f737727c" FOREIGN KEY (next_id) REFERENCES connectr_passageviewevent(id) DEFERRABLE INITIALLY DEFERRED
"passage_id_refs_id_892b36a6" FOREIGN KEY (passage_id) REFERENCES connectr_twitterpassage(id) DEFERRABLE INITIALLY DEFERRED
"source_word_id_refs_id_178d46eb" FOREIGN KEY (source_word_id) REFERENCES connectr_word(id) DEFERRABLE INITIALLY DEFERRED
"user_id_refs_id_478f94a2" FOREIGN KEY (user_id) REFERENCES connectr_user(id) DEFERRABLE INITIALLY DEFERRED
Referenced by:
TABLE "connectr_passageviewevent" CONSTRAINT "next_id_refs_id_f737727c" FOREIGN KEY (next_id) REFERENCES connectr_passageviewevent(id) DEFERRABLE INITIALLY DEFERRED
这是查询的原始 SQL,它(有时)很慢(由 Django 生成):
SELECT "connectr_twitterpassage"."id", "connectr_twitterpassage"."third_party_id", "connectr_twitterpassage"."third_party_created", "connectr_twitterpassage"."source", "connectr_twitterpassage"."text", "connectr_twitterpassage"."author", "connectr_twitterpassage"."raw_data", "connectr_twitterpassage"."retweet_count", "connectr_twitterpassage"."favorited_count", "connectr_twitterpassage"."lang", "connectr_twitterpassage"."location", "connectr_twitterpassage"."author_followers_count", "connectr_twitterpassage"."is_retweet", "connectr_twitterpassage"."url", "connectr_twitterpassage"."author_fk_id", "connectr_twitterpassage"."quality", "connectr_twitterpassage"."is_top_tweet", "connectr_twitterpassage"."created", "connectr_twitterpassage"."modified"
FROM "connectr_twitterpassage" INNER JOIN "connectr_twitterpassage_words"
ON ("connectr_twitterpassage"."id" = "connectr_twitterpassage_words"."twitterpassage_id")
WHERE ("connectr_twitterpassage_words"."word_id" = 19514309
AND NOT (("connectr_twitterpassage"."id"
IN (SELECT U1."passage_id" FROM "connectr_passageviewevent" U1 WHERE (U1."user_id" = 1 AND U1."passage_id" IS NOT NULL)) AND "connectr_twitterpassage"."id" IS NOT NULL)))
ORDER BY "connectr_twitterpassage"."quality" DESC LIMIT 20
添加这些索引后:
create index word_to_twitterpassage_id on connectr_twitterpassage_words (word_id,twitterpassage_id);
create index id_to_quality_sorted on connectr_twitterpassage (id,quality desc nulls last);
解释分析现在是这样的:
Limit (cost=34679.26..34679.31 rows=20 width=206) (actual time=7.883..7.887 rows=20 loops=1)
-> Sort (cost=34679.26..34681.02 rows=704 width=206) (actual time=7.882..7.884 rows=20 loops=1)
Sort Key: connectr_twitterpassage.quality
Sort Method: top-N heapsort Memory: 32kB
-> Nested Loop (cost=16.86..34660.53 rows=704 width=206) (actual time=2.669..7.618 rows=102 loops=1)
-> Index Only Scan using word_to_twitterpassage_id on connectr_twitterpassage_words (cost=0.00..67.21 rows=1408 width=4) (actual time=2.493..3.094 rows=102 loops=1)
Index Cond: (word_id = 18860699)
Heap Fetches: 1
-> Index Scan using connectr_passage_pkey on connectr_twitterpassage (cost=16.86..24.56 rows=1 width=206) (actual time=0.042..0.043 rows=1 loops=102)
Index Cond: (id = connectr_twitterpassage_words.twitterpassage_id)
Filter: ((NOT (hashed SubPlan 1)) OR (id IS NULL))
SubPlan 1
-> Bitmap Heap Scan on connectr_passageviewevent u1 (cost=4.46..16.80 rows=27 width=4) (actual time=0.049..0.066 rows=25 loops=1)
Recheck Cond: (user_id = 1)
Filter: (passage_id IS NOT NULL)
-> Bitmap Index Scan on connectr_passageviewevent_user_id (cost=0.00..4.45 rows=27 width=0) (actual time=0.037..0.037 rows=26 loops=1)
Index Cond: (user_id = 1)
Total runtime: 8.042 ms
(18 rows)