1

我有两张桌子:

packages 和 package_to_tag 都运行 MyISAM

这些表格的结构如下:

+----------------+------------------+----------------+
|   aid(primary) |     source       |   date(index)  |
+----------------+------------------+----------------+
|   1            |    CA            |   2013-04-05   |
+----------------+------------------+----------------+
|   2            |    FL            |   2013-05-05   |
+----------------+------------------+----------------+
|   3            |    UT            |   2012-06-13   |
+----------------+------------------+----------------+
|   4            |    VT            |   2011-04-29   |
+----------------+------------------+----------------+
|   5            |    CT            |   2013-04-10   |
+----------------+------------------+----------------+

package_to_tag package-tag 上的唯一索引,并且 package_aid 和 tag 都有索引

+---------------+------------------+
|  package_aid  |     tag          |
+---------------+------------------+
|   2           |    sports        |
+---------------+------------------+
|   2           |    nba           |
+---------------+------------------+
|   1           |    food          |
+---------------+------------------+
|   1           |    burrito       |
+---------------+------------------+
|   4           |    hockey        |
+---------------+------------------+
|   4           |    sports        |
+---------------+------------------+
|   3           |    news          |
+---------------+------------------+
|   5           |    sports        |
+---------------+------------------+
|   5           |    nba           |
+---------------+------------------+

所以我要找出哪些包裹同时具有运动和 nba 作为标签的基本查询是:

SELECT package_aid FROM package_to_tag
WHERE tag IN("sports","nba")
GROUP BY package_aid
HAVING COUNT(*) = 2

在我尝试将日期排序添加到结果之前,这非常有效。(请记住,我的包裹记录集在 400k 范围内)

我根据匹配标签获取源的查询是:

SELECT package_aid, source 
FROM package_to_tag
RIGHT JOIN packages ON packages.aid = package_to_tag.package_aid
AND tag IN("sports","nba")
GROUP BY package_aid
HAVING COUNT(*) = 2
ORDER BY date DESC
LIMIT 500

其中,有 40 万条记录,最多只需要 5 秒。除非我删除date排序。然后不到一秒钟。因此,由于我在 IN 语句上总是取得了不错的成功,因此我尝试通过以下方式缩小我的初始结果集:

SELECT aid,source FROM packages
WHERE aid IN(
  SELECT package_aid FROM package_to_tag
  WHERE tag IN("sports","nba")
  GROUP BY package_aid
  HAVING COUNT(*) = 2
)
ORDER BY date DESC
LIMIT 500

我想我只会将排序应用于大约 8-10k 条记录,而不是整个记录集。

但是,这只是将数据库固定在 100% 的利用率,我不得不重新启动.... 即使我将带有额外标签的内部选择缩小到总共 80 条记录或更少。

我试着只运行这个查询:

SELECT package_aid FROM package_to_tag
WHERE tag IN("sports","nba")
GROUP BY package_aid
HAVING COUNT(*) = 2

这会在一秒钟内返回 8-10k 条记录。

我错过了什么?

4

1 回答 1

3

MySQL 的早期版本在优化in子查询方面存在问题。一个简单的解决方案是将其重写为exists子句:

SELECT aid,source FROM packages
WHERE exists (
  SELECT package_aid
  FROM package_to_tag
  WHERE tag IN("sports","nba") and package_aid = packages.aid
  GROUP BY package_aid
  HAVING COUNT(*) = 2
)
ORDER BY date DESC
LIMIT 500

有一个索引package_to_tag(pages.aid, tag)应该对性能有很大帮助。

于 2013-06-16T20:34:21.793 回答