6

I have a simple question about the most efficient way to perform a particular join.

Take these three tables, real names have been changed to protect the innocent:

Table: animal

animal_id   name   ...
======================
1           bunny
2           bear
3           cat
4           mouse

Table: tags

tag_id     tag
==================
1          fluffy
2          brown
3          cute
4          small

Mapping Table: animal_tag

animal_id   tag_id
==================
1           1
1           2
1           3
2           2
3           4
4           2

I want to find all animals that are tagged as 'fluffy', 'brown', and 'cute'. That is to say that the animal must be tagged with all three. In reality, the number of required tags can vary, but should be irrelevant for this discussion. This is the query I came up with:

SELECT * FROM animal
JOIN (
      SELECT at.animal_id FROM animal_tag at
      WHERE at.tag_id IN (
                          SELECT tg.tag_id FROM tag tg
                          WHERE tg.tag='fluffy' OR tg.tag='brown' OR tg.tag='cute'
                          )
      GROUP BY at.animal_id HAVING COUNT(at.tag_id)=3
      ) AS jt
ON animal.animal_id=jt.animal_id

On a table with thousands 'animals' and and hundreds of 'tags', this query performs respectably ... 10s of milliseconds. However, when i look at the query plan (Apache Derby is the DB), the optimizer's estimated cost is pretty high (9945.12) and the plan pretty extensive. For a query this "simple" I usually try to get query plans with an estimated cost of single or double digits.

So my question is, is there a better way to perform this query? Seems like a simple query, but I've been stumped coming up with anything better.

4

5 回答 5

1

试一试:

SELECT a.*
FROM animal a
INNER JOIN 
  ( 
    SELECT at.animal_id
    FROM tag t
    INNER JOIN animal_tag at ON at.tag_id = t.tag_id
    WHERE tag IN ('fluffy', 'brown', 'cute')
    GROUP BY at.animal_id
    HAVING count(*) = 3
  ) f ON  a.animal_id = f.animal_id

这是另一种选择,只是为了好玩:

SELECT a.*
FROM animal a
INNER JOIN animal_tag at1 on at1.animal_id = a.animal_id
INNER JOIN tag t1 on t1.tag_id = at1.tag_id
INNER JOIN animal_tag at2 on at2.animal_id = a.animal_id
INNER JOIN tag t2 on t2.tag_id = at2.tag_id
INNER JOIN animal_tag at3 on at3.animal_id = a.animal_id
INNER JOIN tag t3 on t3.tag_id = at3.tag_id
WHERE t1.tag = 'fluffy' AND t2.tag = 'brown' AND t3.tag = 'cute'

我真的不希望最后一个选项做得很好......其他选项避免需要多次返回标签表以从 id 解析标签名称......但你永远不知道查询优化器会做什么直到你尝试。

于 2012-02-07T03:59:41.717 回答
1

您可以使用DECLARE GLOBAL TEMPORARY TABLE创建一个临时表 ,然后执行 INNER JOIN 以消除“W​​HERE IN”。使用基于集合的连接通常比必须为每一行评估的 Where 语句更有效。

于 2012-02-07T02:21:42.397 回答
1

首先,非常感谢所有参与其中的人。正如几位评论者所提到的,最终答案是关系划分。

虽然我在很多个月前确实参加了 Codd 的关系数据模型课程,但与许多课程一样,该课程并没有真正涵盖关系划分。不知不觉中,我原来的查询实际上是一个关系除法的应用。

参考本演示文稿中关于关系除法的幻灯片 26-27,我的查询应用了比较集合基数的技术。我尝试了其他一些用于应用关系除法的方法,但至少在我的情况下,计数方法提供了最快的运行时间。我鼓励任何对此问题感兴趣的人阅读上述幻灯片,以及 Mikael Eriksson 在此页面上引用的文章。再次感谢大家。

于 2012-02-08T00:16:10.907 回答
1

尝试这个:

SELECT DISTINCT f.Animal_ID, g.Name
FROM Animal f INNER JOIN 
    (SELECT a.Animal_ID, a.Name, COUNT(*) as iCount
     FROM   Animal a INNER JOIN Animal_Tag b
                  ON a.Animal_ID = b.animal_ID
                     INNER JOIN Tags c
                  On b.tag_ID = c.tag_ID
    WHERE c.tag IN ('fluffy', 'brown', 'cute') -- list all tags here
    GROUP BY a.Animal_ID) g
WHERE g.iCount = 3 -- No. of tags

更新

    SELECT DISTINCT a.Animal_ID, a.Name, COUNT(*) as iCount
    FROM    Animal a INNER JOIN Animal_Tag b
                  ON a.Animal_ID = b.animal_ID
                     INNER JOIN Tags c
                  On b.tag_ID = c.tag_ID
    WHERE c.tag IN ('fluffy', 'brown', 'cute') -- list all tags here
    GROUP BY Animal_ID
    HAVING  iCount = 3 -- No. of tags
于 2012-02-07T02:37:20.203 回答
0

我想知道在那里使用关系除法会有多糟糕。你能试一试吗?我知道这需要更多时间,但我对多少很感兴趣 :) 如果你能提供估计的成本和时间,那就太好了。

select a2.animal_id, a2.animal_name from animal2 a2
where not exists (
    select * from animal1 a1, tags t1
    where not exists (
        select * from animal_tag at1
        where at1.animal_id = a1.animal_id and at1.animal_tag = t1.tag_id
    ) and a2.animal_id = a1.animal_id and t1.tag in ('fluffy', 'brown', 'cute')
)

现在寻找一个快速查询,我不能比约翰或你的更快。实际上 john's 可能比你的慢一点,因为他正在执行不必要的操作(从 select 中删除 distinct 并删除 count(*)):

SELECT a.Animal_ID, a.Name FROM Animal a
INNER JOIN Animal_Tag b ON a.Animal_ID = b.animal_ID
INNER JOIN Tags c On b.tag_ID = c.tag_ID
WHERE c.tag IN ('fluffy', 'brown', 'cute') -- list all tags here
GROUP BY Animal_ID, a.Name
HAVING count(*) = 3 -- No. of tags

这应该和你的一样快。

PS:有什么方法可以在不复制 where 子句的情况下删除该死的 3 吗?我的大脑在沸腾:)

于 2012-02-07T05:29:41.220 回答