设置
在双核 2GHz + 2GB RAM 机器上运行 SQL Server 7 时,我遇到了性能和概念上的问题 -正如您所料:-/。
情况
我正在使用一个遗留数据库,我需要挖掘数据以获得各种见解。我有all_stats
一张表,其中包含特定上下文中事物的所有统计数据。group_contexts
这些上下文在表格的帮助下进行分组。一个简化的模式:
+--------------------------------------------------------------------+
| thingies |
+--------------------------------------------------------------------|
| id | INT PRIMARY KEY IDENTITY(1,1) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| all_stats |
+--------------------------------------------------------------------+
| id | INT PRIMARY KEY IDENTITY(1,1) |
| context_id | INT FOREIGN KEY REFERENCES contexts(id) |
| value | FLOAT NULL |
| some_date | DATETIME NOT NULL |
| thingy_id | INT NOT NULL FOREIGN KEY REFERENCES thingies(id) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| group_contexts |
+--------------------------------------------------------------------|
| id | INT PRIMARY KEY IDENTITY(1,1) |
| group_id | INT NOT NULL FOREIGN KEY REFERENCES groups(group_id) |
| context_id | INT NOT NULL FOREIGN KEY REFERENCES contexts(id) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| contexts |
+--------------------------------------------------------------------+
| id | INT PRIMARY KEY IDENTITY(1,1) |
+--------------------------------------------------------------------+
+--------------------------------------------------------------------+
| groups |
+--------------------------------------------------------------------+
| group_id | INT PRIMARY KEY IDENTITY(1,1) |
+--------------------------------------------------------------------+
问题
任务是,对于给定的事物集,为事物all_stats.some_date
具有统计信息的所有组查找并汇总事物的 3 个最新 ( ) 统计信息。我知道这听起来很容易,但我不知道如何在 SQL 中正确地做到这一点——我并不完全是一个神童。
我的糟糕解决方案(不,这真的很糟糕......)
我现在的解决方案是用所有必需UNION ALL
的数据和我需要的数据填充一个临时表:
-- Before I'm building this SQL I retrieve the relevant groups
-- for being able to build the `UNION ALL`s at the bottom.
-- I also retrieve the thingies that are relevant in this context
-- beforehand and include their ids as a comma separated list -
-- I said it would be awfull ...
-- Creating the temp table holding all stats data rows
-- for a thingy in a specific group
CREATE TABLE #stats
(id INT PRIMARY KEY IDENTITY(1,1),
group_id INT NOT NULL,
thingy_id INT NOT NULL,
value FLOAT NOT NULL,
some_date DATETIME NOT NULL)
-- Filling the temp table
INSERT INTO #stats(group_id,thingy_id,value,some_date)
SELECT filtered.group_id, filtered.thingy_id, filtered.some_date, filtered.value
FROM
(SELECT joined.group_id,joined.thingy_id,joined.value,joined.some_date
FROM
(SELECT groups.group_id,data.value,data.thingy_id,data.some_date
FROM
-- Getting the groups associated with the contexts
-- of all the stats available
(SELECT DISTINCT context.group_id
FROM all_stats AS stat
INNER JOIN group_contexts AS groupcontext
ON groupcontext.context_id = stat.context_id
) AS groups
INNER JOIN
-- Joining the available groups with the actual
-- stat data of the group for a thingy
(SELECT context.group_id,stat.value,stat.some_date,stat.thingy_id
FROM all_stats AS stat
INNER JOIN group_contexts AS groupcontext
ON groupcontext.context_id = stat.context_id
WHERE stat.value IS NOT NULL
AND stat.value >= 0) AS data
ON data.group_id = groups.group_id) AS joined
) AS filtered
-- I already have the thingies beforehand but if it would be possible
-- to include/query for them in another way that'd be OK by me
WHERE filtered.thingy_id in (/* somewhere around 10000 thingies are available */)
-- Now I'm building the `UNION ALL`s for each thingy as well as
-- the group the stat of the thingy belongs to
-- thingy 42 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 982
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
(SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
FROM #stats AS s
WHERE s.group_id = 982
AND s.thingy_id = 42
ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
UNION ALL
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 314159
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
(SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
FROM #stats AS s
WHERE s.group_id = 314159
AND s.thingy_id = 42
ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
-- }
UNION ALL
-- thingy 21 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 21 in group 982
/* you get the idea */
这工作 - 缓慢,但它工作 - 对于小型数据集(例如,说 100 个事物,每个事物都有 10 个统计信息),但它最终必须工作的问题域是 10000 多个事物,每个事物可能有数百个统计数据。附带说明:生成的 SQL 查询非常大:一个非常小的查询涉及 350 个事物,这些事物在 3 个上下文组中具有数据,总计超过 250 000 条格式化的 SQL 行——在惊人的 5 分钟内执行。
因此,如果有人知道如何解决这个问题,我真的非常感谢您的帮助:-)。