mysql - 优化数亿行表的查询

Question

这感觉就像是“为我做作业”之类的问题，但我真的被困在这里，试图让这个查询对一个有很多行的表快速运行。这是一个显示架构（或多或少）的 SQLFiddle 。

我玩过索引，试图得到一些能显示所有必需列但没有取得多大成功的东西。这是create：

CREATE TABLE `AuditEvent` (
    `auditEventId` bigint(20) NOT NULL AUTO_INCREMENT,
    `eventTime` datetime NOT NULL,
    `target1Id` int(11) DEFAULT NULL,
    `target1Name` varchar(100) DEFAULT NULL,
    `target2Id` int(11) DEFAULT NULL,
    `target2Name` varchar(100) DEFAULT NULL,
    `clientId` int(11) NOT NULL DEFAULT '1',
    `type` int(11) not null,
    PRIMARY KEY (`auditEventId`),
    KEY `Transactions` (`clientId`,`eventTime`,`target1Id`,`type`),
    KEY `TransactionsJoin` (`auditEventId`, `clientId`,`eventTime`,`target1Id`,`type`)
)

和（一个版本）select：

select ae.target1Id, ae.type, count(*)
from AuditEvent ae
where ae.clientId=4
    and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
group by ae.target1Id, ae.type;

我最终也得到了“使用临时文件”和“使用文件排序”。我尝试删除count(*)并使用select distinct，这不会导致“使用文件排序”。join如果有办法返回计数，这可能没问题。

最初，决定跟踪创建审计记录时存在的目标的 target1Name 和 target2Name。我也需要这些名字（最近的就可以了）。

目前，查询（上面，缺少 target1Name 和 target2Name 列）在大约 5 秒内运行约 2400 万条记录。我们的目标是数亿，我们希望查询继续沿着这些路线执行（希望将其保持在 1-2 分钟内，但我们希望它做得更好），但我担心的是一次我们达到了它不会达到的大量数据（正在模拟额外的行）。

我不确定获得额外字段的最佳策略。如果我将列直接添加到select查询中，我会丢失“使用索引”。我试着join回到桌子上，它保留了“使用索引”，但大约需要 20 秒。

我确实尝试将 eventTime 列更改为 int 而不是 datetime，但这似乎并没有影响索引的使用或时间。

score 5 · Accepted Answer

正如您可能理解的那样，这里的问题是范围条件ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00'（一如既往）破坏了索引的有效使用Transactions（即索引实际上仅用于clientId方程和范围条件的第一部分，并且索引不用于分组） .

大多数情况下，解决方案是用相等检查替换范围条件（在您的情况下，引入一period列，分组eventTime到句点并用 a 替换BETWEEN子句period IN (1,2,3,4,5)）。但这可能会成为您餐桌的开销。

您可能会尝试的另一个解决方案是添加另一个索引（Transactions如果不再使用，可能会替换）：，(clientId, target1Id, type, eventTime)并使用以下查询：

SELECT
  ae.target1Id,
  ae.type,
  COUNT(
    NULLIF(ae.eventTime BETWEEN '2011-09-01 03:00:00' 
                            AND '2012-09-30 23:57:00', 0)
  ) as cnt,
FROM AuditEvent ae
WHERE ae.clientId=4
GROUP BY ae.target1Id, ae.type;

这样，您将 a）将范围条件移动到末尾，b）允许使用索引进行分组，c）使索引成为查询的覆盖索引（即查询不需要磁盘 IO 操作）

UPD1： 对不起，昨天我没有仔细阅读你的帖子，没有注意到你的问题是检索target1Name和target2Name。首先，我不确定您是否正确理解Using index. 不存在Using index并不意味着查询没有使用索引，Using index意味着索引本身包含足够的数据来执行子查询（即索引正在覆盖）。由于target1Name和target2Name不包含在任何索引中，因此获取它们的子查询将没有Using index.

如果您的问题只是如何将这两个字段添加到您的查询中（您认为这足够快），那么只需尝试以下操作：

SELECT a1.target1Id, a1.type, cnt, target1Name, target2Name
FROM (
  select ae.target1Id, ae.type, count(*) as cnt, MAX(auditEventId) as max_id
  from AuditEvent ae
  where ae.clientId=4
      and (ae.eventTime between '2011-09-01 03:00:00' and '2012-09-30 23:57:00')
  group by ae.target1Id, ae.type) as a1
JOIN AuditEvent a2 ON a1.max_id = a2.auditEventId
;

mysql - 优化数亿行表的查询

1 回答 1

Related

Reference