sql - 如何有效地在 SQL 中进行连接的交集？

Question

我有三个表，books，tags和taggings( books-xref-tags)：

books
id | title |      author     
 1 | Blink | Malcolm Gladwell
 2 |  1984 |    George Orwell

taggings
book_id | tag_id
      1 |      1
      1 |      2
      2 |      1
      2 |      3

tags
id | name
 1 | interesting
 2 |  nonfiction
 3 |     fiction

我想搜索所有标记为“有趣”和“小说”的书籍。我想出的最好的是

select books.* from books, taggings, tags
 where taggings.book_id = books.id
   and taggings.tag_id  = tag.id
   and tag.name = "interesting"
intersect
select books.* from books, taggings, tags
 where taggings.book_id = books.id
   and taggings.tag_id  = tag.id
   and tag.name = "fiction"

这似乎可行，但我不确定它将如何缩放，无论是行还是标签数量。也就是说，当我添加数百本书、数百个标签和数千个标签时会发生什么？当搜索变成“‘有趣’ 、 ‘虚构’ 、 ‘水生’和‘石工’”时会发生什么？

如果没有更好的方法直接在 SQL 中进行查询，我会考虑另一种方法：

选择所有带有第一个标签的书，以及所有这些书的标签
从列表中删除任何没有查询到所有标签的内容

score 3 · Accepted Answer

如果您想保留使用两个以上标签的选项，这个类似的答案可能会让您感兴趣。

它使用 MySQL 语法（不确定您使用的是什么），但它非常简单，您应该能够将它与其他数据库一起使用。

这对你来说是这样的（使用 MySQL 语法）：

SELECT books.id, books.title, books.author
FROM books
INNER JOIN taggings ON ( taggings.book_id = books.book_id )
INNER JOIN tags ON ( tags.tag_id = taggings.tag_id )
WHERE tags.name IN ( @tag1, @tag2, @tag3 )
GROUP BY books.id, books.title, books.author
HAVING COUNT(*) = @number_of_tags

从我的另一篇文章：

如果您的示例中有 3 个标签，则 number_of_tags 必须为 3，并且连接将导致每个 id 匹配 3 行。

您可以动态创建该查询，也可以使用 10 个标签来定义它，并使用标签中不会出现的值对其进行初始化。

score 1 · Accepted Answer

with
  tt as
  (
      select id
      from tags
      where name in ('interesting', 'fiction')
  ),
  mm as
  (
      select book_id
      from taggings join tt on taggings.tag_id = tt.id
      group by taggings.book_id having count(*) = 2
  )
select books.*
from books join mm on books.id = mm.book_id

这种变体似乎比 Peter Lang 的解决方案产生了更好的执行计划（至少在 Oracle 上），原因如下（转述自EXPLAIN PLAN）：

tags和之间的连接taggings是执行表到索引而不是表到表。我不知道这是否真的会影响大型数据集的查询性能。
该计划在执行最终连接之前对数据集进行分组和计数books。这肯定会影响大型数据集的性能。

score 1 · Accepted Answer

这里有点“老派”的 SQL 方言，但它的语法更紧凑，仍然是内部连接。

select * from books, taggings tg1, tags t1, taggings tg2, tags t2 
 where tg1.book_id = books.id
   and tg1.tag_id  = t1.id
   and t1.name = 'interesting'
   and tg2.book_id = books.id
   and tg2.tag_id  = t2.id
   and t2.name = 'fiction'

编辑：哇，堆垛机非常讨厌在一个查询中加入太多。使用子查询可以进行更多优化exists：

select * from books
 where exists (select * from taggings, tags
                where tags.name = 'fiction'
                  and taggings.tag_id = tags.id
                  and taggings.book_id = books.id)
   and exists (select * from taggings, tags
                where tags.name = 'interesting'
                  and taggings.tag_id = tags.id
                  and taggings.book_id = books.id)

score 1 · Accepted Answer

我会推荐 ALL 而不是 intersect 因为 mysql 实际上知道如何更好地加入这个，尽管我缺乏适当的基准。

select books.* from books, taggings, tags
 where taggings.book_id = books.id
   and taggings.tag_id  = tag.id
   and tag.name ALL("interesting", "fiction");

至于它的扩展性，有数百万本书和标签表上的低基数，您最终要做的是将标签 id 迁移到代码/内存中，以便您使用 taggings.tag_id ALL(3, 7 , 105) 什么的。获取标签表的最后一个连接不会使用索引，除非您超过 1k 个标签，因此您每次都要进行表扫描。

根据我的经验，连接、交叉点和联合对性能来说是巨大的弊端。大多数连接是我们经常遇到的问题。您拥有的连接越少，您最终获得的速度就越快。

score 0 · Accepted Answer

什么数据库？这将稍微改变答案。例如，这适用于 sql server 并且应该更快，因为它消除了两次访问 tags 表的需要，但在 mysql 上会失败，因为 mysql 不做 CTE：

WITH taggingNames
AS
(
    SELECT tag.Name, tag.tag_id, tagging.book_id
    FROM tags
    INNER JOIN taggings ON tags.tag_id = taggings.tagid
) 
SELECT b.* 
FROM books b
INNER JOIN (
  SELECT t1.book_id
   FROM taggingNames 
   INNER JOIN taggingNames t2 ON t2.book_id = t1.book_id AND t2.Name='fiction'
   WHERE t1.Name='interesting' 
   GROUP BY t1.book_id
 ) ids ON b.book_id = ids.book_id

现在想起来我也喜欢彼得朗的回答。

sql - 如何有效地在 SQL 中进行连接的交集？

5 回答 5

Related

Reference