sql - 包含 ORDER BY、LIMIT 和 COUNT 的 SQL 语句中的性能

Question

我已经在一个 SQL 语句中搜索了改进这种危险的函数组合的方法......

为了让您了解上下文，我有一个表格，其中包含有关文章的几个信息（article_id、作者、...），另一个表格包含带有一个 tag_id 的 article_id。由于一篇文章可以有多个标签，因此第二个表可能有 2 行具有相同的 article_id 和不同的 tag_id。

为了获得与我想要的（在本例中为 1354）具有更多共同标签的 8 篇文章的列表，我编写了以下查询：

SELECT articles.article_id, articles.author, count(articles_tags.article_id) as times
FROM articles
INNER JOIN articles_tags ON (articles.article_id=articles_tags.article_id)
WHERE id_tag IN
    (SELECT article_id FROM articles_tags WHERE article_id=1354)
AND article_id <> 1354
GROUP BY article_id
ORDER BY times DESC
LIMIT 8

它非常慢……就像 50 万篇文章需要 90 秒。

通过删除“order by times”这句话，它几乎可以立即生效，但如果我这样做，我将不会得到最相似的文章。

我能做些什么？

谢谢！！

score 1 · Accepted Answer

对子选择的查询总是会浪费时间......另外，由于查询看起来并不准确或缺失，我假设您的articles_tags 表有两列......一列用于实际的文章 ID，另一个是与之关联的 tag_ID。

也就是说，我会预先查询文章 1354 的标签 ID（您感兴趣的那个）。在相同的标签 ID 上再次使用它作为文章标签的笛卡尔连接。从那里，您将获取文章标签别名的第二个版本并获取它的文章 ID，然后是 MATCH 的计数（通过加入而不是左联接）。像您一样在文章 ID 上应用分组依据，并且为了咧嘴笑，加入文章表以获取作者。

现在，请注意。一些 SQL 引擎要求您按所有非聚合字段进行分组，因此您可能必须将作者添加到组中（无论如何，每个文章 ID 始终相同），或者将其更改为 MAX（A.author）作为作者，这将给出相同的结果。

我会在 (tag_id, article_id) 上有一个索引，因此这些标签是从您希望找到的共同标签中找到的。您可能有一篇文章有 10 个标签，而另一篇文章有 10 个完全不同的标签，结果共有 0 个。这将阻止其他文章甚至出现在结果集中。

您仍然有时间浏览您所描述的 50 万篇文章，这可能是数百万个实际标签条目。

select 
      AT2.article_id,
      A.Author,
      count(*) as Times
   from
      ( select ATG.id_tag
           from articles_tags ATG
           where ATG.Article_ID = 1354
           order by ATG.id_tag ) CommonTags
         JOIN articles_tags AT2
            on CommonTags.ID_Tag = AT2.ID_Tag
            AND AT2.Article_ID <> 1354
            JOIN articles A
               on AT2.Article_ID = A.Article_ID
   group by
      AT2.article_id
   order by
      Times DESC
   limit 8

score 0 · Accepted Answer

It seems that it should be possible to do this without any subqueries, and then a quicker query may result.

Here the article of interest is joined to its tags, and then further to other articles having these tags. Then the number of tags for each article is counted and ordered:

SELECT a2.article_id, a2.author, COUNT(t2.tag_id) AS times
FROM articles a1 
INNER JOIN articles_tags t1
ON t1.article_id = a1.article_id   -- find tags for staring article
INNER JOIN tags t2
ON t2.tag_id = t1.tag_id           -- find other instances of those tags
AND t2.articles_id <> t1.articles_id
INNER JOIN articles a2
ON a2.articles_id = t2.articles_id -- and the articles where they are used
WHERE a1.article_id = 1354
GROUP BY a2.article_id, a2.author  -- count common tags by articles
ORDER BY times DESC
LIMIT 8

If you know a lower bound on the number of tags in common (e.g. 3), inserting HAVING times > 2 before ORDER BY times DESC could give a further speed improvement.

sql - 包含 ORDER BY、LIMIT 和 COUNT 的 SQL 语句中的性能

2 回答 2

Related

Reference