我目前正在开发一个本土分析系统,目前在 Windows Server 2008 上使用 MySQL 5.6.10(很快就会迁移到 Linux,而且我们还没有死心于 MySQL,仍在探索不同的选项,包括 Hadoop)。
我们刚刚完成了一个巨大的导入,对于一个小客户来说是闪电般快速的查询现在对于一个大客户来说慢得难以忍受。我可能会添加一个全新的表来预先计算这个查询的结果,除非我能弄清楚如何让查询本身变得更快。
查询所做的是将@StartDate 和@EndDate 作为参数,并计算该范围内的每一天的日期、该日期的新评论数、总评论数(包括@StartDate 之前的任何评论) ,以及每日平均评分(如果没有给定日期的信息,则平均评分将从前一天结转)。
可用的过滤器包括年龄、性别、产品、公司和评级类型。每条评论都有 1-N 评级,至少包含“总体”评级,但每个客户/产品可能更多,例如“质量”、“声音质量”、“耐用性”、“价值”等……
调用它的 API 会根据用户选择注入这些过滤器。如果未指定评级类型,它使用“AND ratingTypeId = 1”代替我将在下面列出的查询的所有三个部分中的 AND 子句注释。所有评级都是 1 到 5 之间的整数,尽管这对这个查询并不重要。
这是我正在使用的表:
CREATE TABLE `times` (
`timeId` int(11) NOT NULL AUTO_INCREMENT,
`date` date NOT NULL,
`month` char(7) NOT NULL,
`quarter` char(7) NOT NULL,
`year` char(4) NOT NULL,
PRIMARY KEY (`timeId`),
UNIQUE KEY `date` (`date`)
) ENGINE=MyISAM
CREATE TABLE `reviewCount` (
`companyId` int(11) NOT NULL,
`productId` int(11) NOT NULL,
`createdOnTimeId` int(11) NOT NULL,
`ageId` int(11) NOT NULL,
`genderId` int(11) NOT NULL,
`totalReviews` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`companyId`,`productId`,`createdOnTimeId`,`ageId`,`genderId`),
KEY `companyId_fk` (`companyId`),
KEY `productId_fk` (`productId`),
KEY `createdOnTimeId` (`createdOnTimeId`),
KEY `ageId_fk` (`ageId`),
KEY `genderId_fk` (`genderId`)
) ENGINE=MyISAM
CREATE TABLE `ratingCount` (
`companyId` int(11) NOT NULL,
`productId` int(11) NOT NULL,
`createdOnTimeId` int(11) NOT NULL,
`ageId` int(11) NOT NULL,
`genderId` int(11) NOT NULL,
`ratingTypeId` int(11) NOT NULL,
`negativeRatings` int(10) unsigned NOT NULL DEFAULT '0',
`positiveRatings` int(10) unsigned NOT NULL DEFAULT '0',
`neutralRatings` int(10) unsigned NOT NULL DEFAULT '0',
`totalRatings` int(10) unsigned NOT NULL DEFAULT '0',
`ratingsSum` double unsigned DEFAULT '0',
`totalRecommendations` int(10) unsigned NOT NULL DEFAULT '0',
PRIMARY KEY (`companyId`,`productId`,`createdOnTimeId`,`ageId`,`genderId`,`ratingTypeId`),
KEY `companyId_fk` (`companyId`),
KEY `productId_fk` (`productId`),
KEY `createdOnTimeId` (`createdOnTimeId`),
KEY `ageId_fk` (`ageId`),
KEY `genderId_fk` (`genderId`),
KEY `ratingTypeId_fk` (`ratingTypeId`)
) ENGINE=MyISAM
'times' 表预先填充了从 1900-01-01 到 2049-12-31 的每一天,两个计数表由 ETL 脚本填充,其中包含按公司、产品、年龄分组的汇总查询,性别、评分类型等...
我对查询的期望是这样的:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 7020 10586 4.017514595496247
2013-01-25 5505 16091 4.058400718778077
2013-01-27 2043 18134 3.992957746478873
2013-01-28 3280 21414 3.983625730994152
2013-01-29 4648 26062 3.921597633136095
...
2013-03-09 1608 60297 3.9409722222222223
2013-03-10 470 60767 3.7743682310469313
2013-03-11 1028 61795 4.036697247706422
2013-03-13 494 62289 3.857388316151203
2013-03-14 449 62738 3.8282208588957056
我很确定我可以预先计算按年龄、性别等分组的所有内容……除了平均值,但我可能错了。如果我在一天内对两种产品进行了 3 条评论,所有其他组都不同,其中一个的评分为 2 和 5,另一个评分为 4,第一个的每日平均评分为 3.5,第二个评分为 4。这些平均值会给我 3.75,而我预计会得到 3.66667。也许我可以做一些事情,例如将该分组的平均值乘以评论数量,以获得当天的总评分,将它们相加,然后将它们除以最后的总评分数。似乎需要做很多额外的工作,但它可能比我目前正在做的要快。说到这里,这是我当前的查询:
SET @cumulativeCount :=
(SELECT coalesce(sum(rc.totalReviews), 0)
FROM reviewCount rc
INNER JOIN times dt ON rc.createdOnTimeId = dt.timeId
WHERE dt.date < @StartDate
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
);
SET @dailyAverageWithCarry :=
(SELECT SUM(rc.ratingsSum) / SUM(rc.totalRatings)
FROM ratingCount rc
INNER JOIN times dt ON rc.createdOnTimeId = dt.timeId
WHERE dt.date < @StartDate
AND rc.totalRatings > 0
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
GROUP BY dt.timeId
ORDER BY dt.date DESC LIMIT 1
);
SELECT
subquery.d AS `Date`,
subquery.newReviewsCount AS `NewReviews`,
(@cumulativeCount := @cumulativeCount + subquery.newReviewsCount) AS `CumulativeReviewsCount`,
(@dailyAverageWithCarry := COALESCE(subquery.dailyRatingAverage, @dailyAverageWithCarry)) AS `DailyRatingAverage`
FROM
(
SELECT
dt.date AS d,
COALESCE(SUM(rc.totalReviews), 0) AS newReviewsCount,
SUM(rac.ratingsSum) / SUM(rac.totalRatings) AS dailyRatingAverage
FROM times dt
LEFT JOIN reviewCount rc ON dt.timeId = rc.createdOnTimeId
LEFT JOIN ratingCount rac ON dt.timeId = rac.createdOnTimeId
WHERE dt.date BETWEEN @StartDate AND @EndDate
-- AND clause for filtering by ratingType (default 1), age, gender, product, and company is injected here in C#
GROUP BY dt.timeId
ORDER BY dt.timeId
) AS subquery;
该查询当前需要大约 2 分钟才能运行,其中行数如下:
times 54787
reviewCount 276389
ratingCount 473683
age 122
gender 3
ratingType 28
product 70070
任何帮助将不胜感激。我想让这个查询更快,或者如果这样做会更快,预先计算按日期、年龄、性别、产品、公司和 ratingType 分组的值,然后快速汇总查询该表。
更新#1:我尝试了 Meherzad 的建议,即向时间和 ratingCount 添加索引:
ALTER TABLE times ADD KEY `timeId_date_key` (`timeId`, `date`);
ALTER TABLE ratingCount ADD KEY `createdOnTimeId_totalRatings_key` (`createdOnTimeId`, `totalRatings`);
然后再次运行我的初始查询,它快了大约 1 秒(~89 秒),但仍然太慢。我尝试了 Meherzad 建议的查询,几分钟后不得不将其杀死。
根据要求,这是我查询的 EXPLAIN 结果:
id|select_type|table|type|possible_keys|key|key_len|ref|rows|Extra
1|PRIMARY|<derived2>|ALL|NULL|NULL|NULL|NULL|6808032|NULL
2|DERIVED|dt|range|PRIMARY,timeId_date_key,date|date|3|NULL|88|Using index condition; Using temporary; Using filesort
2|DERIVED|rc|ref|PRIMARY,companyId_fk,createdOnTimeId|createdOnTimeId|4|dt.timeId|126|Using where
2|DERIVED|rac|ref|createdOnTimeId,createdOnTimeId_total_ratings_key|createdOnTimeId|4|dt.timeId|614|NULL
我检查了有关缓冲区大小的文章中提到的缓存读取未命中率,它是
Key_reads 58303
Key_read_requests 147411279
For a miss rate of 3.9551247635535405672723319902814e-4
更新#2:解决了!指数肯定有帮助,所以我将把答案归功于 Meherzad。实际上最大的不同是意识到计算同一查询中的滚动平均值和每日/累积评论计数是将这两个巨大的表连接在一起。我看到变量初始化是在两个单独的查询中完成的,因此决定尝试将两个大查询分成子查询,然后根据 timeId 将它们连接起来。现在它在 0.358 秒内运行,查询如下:
SET @StartDate = '2013-01-24';
SET @EndDate = '2013-04-24';
SELECT
@StartDateId:=MIN(timeId), @EndDateId:=MAX(timeId)
FROM
times
WHERE
date IN (@StartDate , @EndDate);
SELECT
@CumulativeCount:=COALESCE(SUM(totalReviews), 0)
FROM
reviewCount
WHERE
createdOnTimeId < @StartDateId
-- Add Filters
;
SELECT
@DailyAverage:=COALESCE(SUM(ratingsSum) / SUM(totalRatings), 0)
FROM
ratingCount
WHERE
createdOnTimeId < @StartDateId
AND totalRatings > 0
-- Add Filters
GROUP BY createdOnTimeId
ORDER BY createdOnTimeId DESC
LIMIT 1;
SELECT
t.date AS `Date`,
COALESCE(q1.newReviewsCount, 0) AS `NewReviews`,
(@CumulativeCount:=@CumulativeCount + COALESCE(q1.newReviewsCount, 0)) AS `CumulativeReviewsCount`,
(@DailyAverage:=COALESCE(q2.dailyRatingAverage,
COALESCE(@DailyAverage, 0))) AS `DailyRatingAverage`
FROM
times t
LEFT JOIN
(SELECT
rc.createdOnTimeId AS createdOnTimeId,
COALESCE(SUM(rc.totalReviews), 0) AS newReviewsCount
FROM
reviewCount rc
WHERE
rc.createdOnTimeId BETWEEN @StartDateId AND @EndDateId
-- Add Filters
GROUP BY rc.createdOnTimeId) AS q1 ON t.timeId = q1.createdOnTimeId
LEFT JOIN
(SELECT
rc.createdOnTimeId AS createdOnTimeId,
SUM(rc.ratingsSum) / SUM(rc.totalRatings) AS dailyRatingAverage
FROM
ratingCount rc
WHERE
rc.createdOnTimeId BETWEEN @StartDateId AND @EndDateId
-- Add Filters
GROUP BY rc.createdOnTimeId) AS q2 ON t.timeId = q2.createdOnTimeId
WHERE
t.timeId BETWEEN @StartDateId AND @EndDateId;
我曾假设两个子查询会非常慢,但它们非常快,因为它们没有加入完全不相关的行。它还指出了一个事实,即我之前的结果与我之前的结果相去甚远。例如,从上面:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 7020 10586 4.017514595496247
应该是,现在是:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 599 407327 4.017514595496247
平均值是正确的,但是连接搞砸了新评论和累积评论的数量,我用一个查询验证了这一点。
我还摆脱了与时间表的联接,而是在快速初始化查询中确定开始和结束日期 ID,然后在最后重新加入时间表。
现在结果是:
Date NewReviews CumulativeReviewsCount DailyRatingAverage
2013-01-24 599 407327 4.017514595496247
2013-01-25 551 407878 4.058400718778077
2013-01-26 455 408333 3.838926174496644
2013-01-27 433 408766 3.992957746478873
2013-01-28 425 409191 3.983625730994152
...
2013-04-13 170 426066 3.874239350912779
2013-04-14 182 426248 3.585714285714286
2013-04-15 171 426419 3.6202531645569622
2013-04-16 0 426419 3.6202531645569622
2013-04-17 0 426419 3.6202531645569622
2013-04-18 0 426419 3.6202531645569622
2013-04-19 0 426419 3.6202531645569622
2013-04-20 0 426419 3.6202531645569622
2013-04-21 0 426419 3.6202531645569622
2013-04-22 0 426419 3.6202531645569622
2013-04-23 0 426419 3.6202531645569622
2013-04-24 0 426419 3.6202531645569622
最后几个平均值也正确地承载了较早的平均值,因为我们在大约 10 天内没有从该客户的数据馈送中导入数据。
谢谢您的帮助!