0

Suppose I have the following database setup (a simplified version from what I actually have):

Table: news_posting (500,000+ entries)
| --------------------------------------------------------------|
| posting_id  | name      | is_active   | released_date | token |
| 1           | posting_1 | 1           | 2013-01-10    | 123   |
| 2           | posting_2 | 1           | 2013-01-11    | 124   |
| 3           | posting_3 | 0           | 2013-01-12    | 125   |
| --------------------------------------------------------------|
PRIMARY posting_id
INDEX sorting ON (is_active, released_date, token)

Table: news_category (500 entries)
| ------------------------------|
| category_id   | name          |
| 1             | category_1    |
| 2             | category_2    |
| 3             | category_3    |
| ------------------------------|
PRIMARY category_id

Table: news_cat_match (1,000,000+ entries)
| ------------------------------|
| category_id   | posting_id    |
| 1             | 1             |
| 2             | 1             |
| 3             | 1             |
| 2             | 2             |
| 3             | 2             |
| 1             | 3             |
| 2             | 3             |
| ------------------------------|
UNIQUE idx (category_id, posting_id)

My task is as follows. I must get a list of 50 latest news postings (at some offset) that are active, that are before today's date, and that are in one of the 20 or so categories that are specified in the request. Before I choose the 50 news postings to return, I must sort the appropriate news postings by token in descending order. My query is currently similar to the following:

SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20))
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50

With just one specified category_id the query does not involve a filesort and is reasonably fast because it does not have to process removal of duplicate results. However, calling EXPLAIN on the above query that has multiple category_id's returns a table that says that there is filesort to be done. And, the query is extremely slow on my data set.

Is there any way to optimize the table setup and/or the query?

4

1 回答 1

0

通过将其重写如下,我能够使上述查询比使用单值类别列表版本运行得更快:

SELECT posting_id
FROM news_posting np
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
AND EXISTS (
    SELECT ncm.posting_id
    FROM news_cat_match ncm 
    WHERE ncm.posting_id = np.posting_id
    AND ncm.category_id IN (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20)
    LIMIT 1
)
ORDER BY np.token DESC LIMIT 50

现在我的数据集需要不到一秒钟的时间。

可悲的是,这甚至比只category_id指定一个还要快。这是因为新闻项目的子集比只有一个的要大category_id,所以它可以更快地找到结果。

现在我的下一个问题是,当一个类别只有很少的新闻及时传播的情况下,这是否可以优化?

以下在我的开发机器上仍然很慢。尽管它在生产服务器上足够快,但如果可能的话,我想对其进行优化。

SELECT DISTINCT posting_id
FROM news_posting np
INNER JOIN news_cat_match ncm ON (ncm.posting_id = np.posting_id AND ncm.category_id = 1)
WHERE np.is_active = 1
AND np.released_date < '2013-01-28'
ORDER BY np.token DESC LIMIT 50

有没有人有任何进一步的建议?

于 2013-01-29T19:48:59.570 回答