在尝试从多个连接表中提取大量列(~15-20)时,我将两个视图放在一起,它们可以提取必要的信息。但是,在我的本地数据库中(只有约 1kposts
行),加入这些视图效果很好;当我在我们的生产数据库(约 30k 行)上创建相同的视图posts
并尝试加入该视图时,我意识到该解决方案不会扩展到测试数据集之外。
我试图将这 2 个视图(类别数据——比如——categories.title
和创作者的数据——比如.users.display_name
post_data
我已经将一个示例DBFiddle与一些测试数据放在一起来解释表结构。实际数据有更多列,但这代表了构建查询所需的连接。
table : posts
+-----+-----------+------------+------------------------------------------+----------------------------------------+
| id | parent_id | created_by | message | attachments |
+-----+-----------+------------+------------------------------------------+----------------------------------------+
| 8 | NULL | 8 | laptop for sale | [{"media_id": 1380}] |
| 9 | NULL | 4 | NEW lamp shade up for grabs | [{"media_id": 1442}, {"link_id": 103}] |
| 10 | 1 | 7 | Oooh I could be interested | |
| 11 | 1 | 7 | DMing you now! I've been looking for one | |
+-----+-----------+------------+------------------------------------------+----------------------------------------+
table : users
+----+------------------+---------------------------+
| id | display_name | created_at |
+----+------------------+---------------------------+
| 1 | John Appleseed | 2018-02-20T00:00:00+00:00 |
| 2 | Massimo Jenkins | 2018-05-14T00:00:00+00:00 |
| 3 | Johanna Marionna | 2018-06-05T00:00:00+00:00 |
| 4 | Jackson Creek | 2018-11-15T00:00:00+00:00 |
| 5 | Joe Schmoe | 2019-01-09T00:00:00+00:00 |
| 6 | John Johnson | 2019-02-14T00:00:00+00:00 |
| 7 | Donna Madison | 2019-05-14T00:00:00+00:00 |
| 8 | Jenna Kaplan | 2019-06-23T00:00:00+00:00 |
+----+------------------+---------------------------+
table : categories
+----+------------+------------+-------------------------------------------------------+
| id | created_by | title | description |
+----+------------+------------+-------------------------------------------------------+
| 1 | 2 | Technology | Anything tech; Consumer, business or education tools! |
| 2 | 2 | Home Goods | Anything for the home |
+----+------------+------------+-------------------------------------------------------+
table : categories_posts
+---------+-------------+
| post_id | category_id |
+---------+-------------+
| 8 | 1 |
| 9 | 1 |
| 10 | 1 |
| 11 | 1 |
+---------+-------------+
table : users_categories
+---------+-------------+
| user_id | category_id |
+---------+-------------+
| 1 | 1 |
| 2 | 1 |
| 3 | 1 |
| 4 | 1 |
+---------+-------------+
table : posts_removed
+---------+----------------------+------------+
| post_id | removed_at | removed_by |
+---------+----------------------+------------+
| 10 | 2019-01-22 09:08:14 | 7 |
+---------+----------------------+------------+
在下面的查询中,符合条件的职位是在 base 中确定的SELECT
;然后,将 post_data CTE 加入结果集(限制为 25 行)并返回 CTE 中的所有列。
WITH post_data AS (
SELECT posts.id,
posts.parent_id,
posts.created_by,
posts.attachments,
categories_posts.category_id,
categories.title,
categories.created_by AS category_created_by,
creator.display_name AS creator_display_name,
creator.created_at AS creator_created_at
/* ... And a whole bunch of other fields from posts, categories_posts, users */
FROM posts
LEFT OUTER JOIN categories_posts
ON categories_posts.post_id = posts.id
LEFT OUTER JOIN categories
ON categories.id = categories_posts.category_id
LEFT OUTER JOIN users creator
ON creator.id = posts.created_by
/* ... And a whole bunch of other joins to facilitate the selected fields */
)
SELECT post_data.*
FROM posts
/* Set up the criteria for the posts selected before getting their data from the CTE */
LEFT OUTER JOIN posts_removed removed ON removed.post_id = posts.id
LEFT OUTER JOIN users user_me ON user_me.id = "1"
LEFT OUTER JOIN users_followed ON users_followed.user_id = posts.created_by
AND users_followed.followed_by = user_me.id
LEFT OUTER JOIN categories_posts ON categories_posts.post_id = posts.id
LEFT OUTER JOIN users_categories ON users_categories.category_id = categories_posts.category_id
LEFT OUTER JOIN posts_removed pp_removed ON pp_removed.post_id = posts.parent_id
/* Join our post_data on the post's ID */
JOIN post_data ON post_data.id = posts.id
WHERE
(
(
users_categories.user_id = user_me.id AND users_categories.left_at IS NULL
) OR categories_posts.category_id IS NULL
) AND (
posts.created_by = user_me.id
OR users_followed.followed_by = user_me.id
OR categories_posts.category_id IS NOT NULL
) AND removed.removed_at IS NULL
AND pp_removed.removed_at IS NULL
AND (post_data.id = posts.id OR post_data.id = posts.parent_id)
ORDER BY posts.id DESC
LIMIT 25
理论上,我认为这可以通过根据基本选择标准选择行,然后根据 Post ID 对 CTE 进行索引扫描来实现;但是,查询优化器似乎选择对表进行全表扫描posts
。
给EXPLAIN SELECT
了我这个信息:
+----+-------------+------------------------+--------+-------------------------------+-------------+---------+---------------------------------------------+--------+----------+----------------------------------------------------+
| id | select_type | table | type | possible_keys | key | key_len | ref | rows | filtered | extra |
+----+-------------+------------------------+--------+-------------------------------+-------------+---------+---------------------------------------------+--------+----------+----------------------------------------------------+
| 1 | PRIMARY | posts | ALL | PRIMARY,parent_id,created_by | | | | 33870 | 100 | Using temporary; Using filesort |
| 1 | PRIMARY | removed | eq_ref | PRIMARY | PRIMARY | 8 | posts.id | 1 | 19 | Using where |
| 1 | PRIMARY | user_me | const | PRIMARY | PRIMARY | 8 | const | 1 | 100 | Using where; Using index |
| 1 | PRIMARY | categories_posts | eq_ref | PRIMARY | PRIMARY | 8 | api.posts.id | 1 | 100 | |
| 1 | PRIMARY | categories | eq_ref | PRIMARY | PRIMARY | 8 | categories_posts.category_id | 1 | 100 | Using index |
| 1 | PRIMARY | users_categories | eq_ref | user_id_2,user_id,category_id | user_id_2 | 16 | user_me.id,api.categories_posts.category_id | 1 | 100 | Using where |
| 1 | PRIMARY | users_followed | eq_ref | user_id,followed_by | user_id | 16 | posts.created_by,api.user_me.id | 1 | 100 | Using where; Using index |
| 1 | PRIMARY | pp_removed | eq_ref | PRIMARY | PRIMARY | 8 | api.posts.parent_id | 1 | 19 | Using where |
| 1 | PRIMARY | <derived2> | ALL | | | | | 493911 | 19 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DERIVED | posts | ALL | | | | | 33870 | 100 | Using temporary |
| 2 | DERIVED | categories_posts | eq_ref | PRIMARY | PRIMARY | 8 | api.posts.id | 1 | 100 | |
| 2 | DERIVED | categories | eq_ref | PRIMARY | PRIMARY | 8 | api.categories_posts.category_id | 1 | 100 | |
| 2 | DERIVED | posts_votes | ref | post_id | post_id | 8 | api.posts.id | 1 | 100 | Using index |
| 2 | DERIVED | pp | eq_ref | PRIMARY | PRIMARY | 8 | api.posts.parent_id | 1 | 100 | |
| 2 | DERIVED | pp_removed | eq_ref | PRIMARY | PRIMARY | 8 | api.pp.id | 1 | 100 | Using index |
| 2 | DERIVED | removed | eq_ref | PRIMARY | PRIMARY | 8 | api.posts.id | 1 | 100 | Using index |
| 2 | DERIVED | creator | eq_ref | PRIMARY | PRIMARY | 8 | api.posts.created_by | 1 | 100 | |
| 2 | DERIVED | usernames | ref | user_id | user_id | 8 | api.creator.id | 1 | 100 | |
| 2 | DERIVED | verifications | ALL | | | | | 4 | 100 | Using where; Using join buffer (Block Nested Loop) |
| 2 | DERIVED | categories_identifiers | ref | category_id | category_id | 8 | api.categories.id | 1 | 100 | |
+----+-------------+------------------------+--------+-------------------------------+-------------+---------+---------------------------------------------+--------+----------+----------------------------------------------------+
除此之外,我尝试重构查询以尝试强制在posts
表中使用键,例如FORCE INDEX(PRIMARY)
在选择中使用,并将 CTE 移动为基本查询并添加过滤器WHERE id IN ({the original base query})
,但优化器似乎仍在进行全表扫描。
如果有助于解码查询计划中发生的事情:
- 在撰写本文时,有33,387
posts
行,但查询计划显示 - 查询计划显示返回33,870行的全表扫描
- 查询计划还将派生表 (
<derived2>
) 显示为具有493,911行
我的核心问题是:
当我说子查询应该只对来自基本选择查询的每个结果行执行一次时,我是否正确?如果是这样,那么 CTE 也应该使用 JOIN
posts.id
并可能使用表索引?为什么查询计划显示它只有33,387行时选择了 33,870 行?493,911 行是从哪里来的?
在这种情况下如何防止全表扫描?