显然有很多不同的方法可以获得相同的结果,你的问题似乎是在 MySQL 中获得每个组的最后结果的有效方法。如果您正在处理大量数据并假设您正在使用 InnoDB,即使是最新版本的 MySQL(例如 5.7.21 和 8.0.4-rc),那么可能没有有效的方法来执行此操作。
我们有时需要对超过 6000 万行的表执行此操作。
对于这些示例,我将使用只有大约 150 万行的数据,其中查询需要查找数据中所有组的结果。在我们的实际案例中,我们通常需要返回大约 2,000 个组的数据(假设不需要检查太多数据)。
我将使用以下表格:
CREATE TABLE temperature(
id INT UNSIGNED NOT NULL AUTO_INCREMENT,
groupID INT UNSIGNED NOT NULL,
recordedTimestamp TIMESTAMP NOT NULL,
recordedValue INT NOT NULL,
INDEX groupIndex(groupID, recordedTimestamp),
PRIMARY KEY (id)
);
CREATE TEMPORARY TABLE selected_group(id INT UNSIGNED NOT NULL, PRIMARY KEY(id));
温度表包含大约 150 万条随机记录,以及 100 个不同的组。selected_group 填充了这 100 个组(在我们的例子中,这通常小于所有组的 20%)。
由于此数据是随机的,这意味着多行可以具有相同的记录时间戳。我们想要的是按照 groupID 的顺序获取所有选定组的列表,其中包含每个组的最后记录的时间戳,如果同一个组有多个这样的匹配行,那么这些行的最后一个匹配 id。
如果假设 MySQL 有一个 last() 函数,它在特殊的 ORDER BY 子句中从最后一行返回值,那么我们可以简单地做:
SELECT
last(t1.id) AS id,
t1.groupID,
last(t1.recordedTimestamp) AS recordedTimestamp,
last(t1.recordedValue) AS recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
ORDER BY t1.recordedTimestamp, t1.id
GROUP BY t1.groupID;
在这种情况下只需要检查 100 行,因为它不使用任何正常的 GROUP BY 函数。这将在 0 秒内执行,因此效率很高。请注意,通常在 MySQL 中,我们会在 GROUP BY 子句之后看到 ORDER BY 子句,但是这个 ORDER BY 子句用于确定 last() 函数的 ORDER,如果它在 GROUP BY 之后,那么它将对 GROUPS 进行排序。如果不存在 GROUP BY 子句,则所有返回行中的最后一个值将相同。
然而 MySQL 没有这个,所以让我们看看它有什么的不同想法,并证明这些都不是有效的。
示例 1
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
SELECT t2.id
FROM temperature t2
WHERE t2.groupID = g.id
ORDER BY t2.recordedTimestamp DESC, t2.id DESC
LIMIT 1
);
这检查了 3,009,254 行,在 5.7.21 上花费了约 0.859 秒,在 8.0.4-rc 上稍长一些
示例 2
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM temperature t1
INNER JOIN (
SELECT max(t2.id) AS id
FROM temperature t2
INNER JOIN (
SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
FROM selected_group g
INNER JOIN temperature t3 ON t3.groupID = g.id
GROUP BY t3.groupID
) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
GROUP BY t2.groupID
) t5 ON t5.id = t1.id;
这检查了 1,505,331 行,在 5.7.21 上花费了大约 1.25 秒,在 8.0.4-rc 上花费了稍长的时间
示例 3
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM temperature t1
WHERE t1.id IN (
SELECT max(t2.id) AS id
FROM temperature t2
INNER JOIN (
SELECT t3.groupID, max(t3.recordedTimestamp) AS recordedTimestamp
FROM selected_group g
INNER JOIN temperature t3 ON t3.groupID = g.id
GROUP BY t3.groupID
) t4 ON t4.groupID = t2.groupID AND t4.recordedTimestamp = t2.recordedTimestamp
GROUP BY t2.groupID
)
ORDER BY t1.groupID;
这检查了 3,009,685 行,在 5.7.21 上花费了大约 1.95 秒,在 8.0.4-rc 上稍长一些
示例 4
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.id = (
SELECT max(t2.id)
FROM temperature t2
WHERE t2.groupID = g.id AND t2.recordedTimestamp = (
SELECT max(t3.recordedTimestamp)
FROM temperature t3
WHERE t3.groupID = g.id
)
);
这检查了 6,137,810 行,在 5.7.21 上花费了约 2.2 秒,在 8.0.4-rc 上稍长一些
示例 5
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM (
SELECT
t2.id,
t2.groupID,
t2.recordedTimestamp,
t2.recordedValue,
row_number() OVER (
PARTITION BY t2.groupID ORDER BY t2.recordedTimestamp DESC, t2.id DESC
) AS rowNumber
FROM selected_group g
INNER JOIN temperature t2 ON t2.groupID = g.id
) t1 WHERE t1.rowNumber = 1;
这检查了 6,017,808 行,在 8.0.4-rc 上花费了大约 4.2 秒
例 6
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM (
SELECT
last_value(t2.id) OVER w AS id,
t2.groupID,
last_value(t2.recordedTimestamp) OVER w AS recordedTimestamp,
last_value(t2.recordedValue) OVER w AS recordedValue
FROM selected_group g
INNER JOIN temperature t2 ON t2.groupID = g.id
WINDOW w AS (
PARTITION BY t2.groupID
ORDER BY t2.recordedTimestamp, t2.id
RANGE BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING
)
) t1
GROUP BY t1.groupID;
这检查了 6,017,908 行,在 8.0.4-rc 上花费了大约 17.5 秒
例 7
SELECT t1.id, t1.groupID, t1.recordedTimestamp, t1.recordedValue
FROM selected_group g
INNER JOIN temperature t1 ON t1.groupID = g.id
LEFT JOIN temperature t2
ON t2.groupID = g.id
AND (
t2.recordedTimestamp > t1.recordedTimestamp
OR (t2.recordedTimestamp = t1.recordedTimestamp AND t2.id > t1.id)
)
WHERE t2.id IS NULL
ORDER BY t1.groupID;
这个是永远的,所以我不得不杀了它。