0

设置

在双核 2GHz + 2GB RAM 机器上运行 SQL Server 7 时,我遇到了性能和概念上的问题 -正如您所料:-/。

情况

我正在使用一个遗留数据库,我需要挖掘数据以获得各种见解。我有all_stats一张表,其中包含特定上下文中事物的所有统计数据。group_contexts这些上下文在表格的帮助下进行分组。一个简化的模式:

+--------------------------------------------------------------------+
| thingies                                                           |
+--------------------------------------------------------------------|
| id          | INT PRIMARY KEY IDENTITY(1,1)                        |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
| all_stats                                                          |
+--------------------------------------------------------------------+
| id          | INT PRIMARY KEY IDENTITY(1,1)                        |
| context_id  | INT FOREIGN KEY REFERENCES contexts(id)              |
| value       | FLOAT NULL                                           |
| some_date   | DATETIME NOT NULL                                    |
| thingy_id   | INT NOT NULL FOREIGN KEY REFERENCES thingies(id)     |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
| group_contexts                                                     |
+--------------------------------------------------------------------|
| id          | INT PRIMARY KEY IDENTITY(1,1)                        |
| group_id    | INT NOT NULL FOREIGN KEY REFERENCES groups(group_id) |
| context_id  | INT NOT NULL FOREIGN KEY REFERENCES contexts(id)     |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
| contexts                                                           |
+--------------------------------------------------------------------+
| id          | INT PRIMARY KEY IDENTITY(1,1)                        |
+--------------------------------------------------------------------+

+--------------------------------------------------------------------+
| groups                                                             |
+--------------------------------------------------------------------+
| group_id    | INT PRIMARY KEY IDENTITY(1,1)                        |
+--------------------------------------------------------------------+

问题

任务是,对于给定的事物集,为事物all_stats.some_date具有统计信息的所有组查找并汇总事物的 3 个最新 ( ) 统计信息。我知道这听起来很容易,但我不知道如何在 SQL 中正确地做到这一点——我并不完全是一个神童。

我的糟糕解决方案(不,这真的很糟糕......)

我现在的解决方案是用所有必需UNION ALL的数据和我需要的数据填充一个临时表:

-- Before I'm building this SQL I retrieve the relevant groups
-- for being able to build the `UNION ALL`s at the bottom.
-- I also retrieve the thingies that are relevant in this context
-- beforehand and include their ids as a comma separated list -
-- I said it would be awfull ...

-- Creating the temp table holding all stats data rows
-- for a thingy in a specific group
CREATE TABLE #stats
(id INT PRIMARY KEY IDENTITY(1,1),
 group_id INT NOT NULL,
 thingy_id INT NOT NULL,
 value FLOAT NOT NULL,
 some_date DATETIME NOT NULL)

-- Filling the temp table
INSERT INTO #stats(group_id,thingy_id,value,some_date)
SELECT filtered.group_id, filtered.thingy_id, filtered.some_date, filtered.value
FROM
   (SELECT joined.group_id,joined.thingy_id,joined.value,joined.some_date
    FROM
       (SELECT groups.group_id,data.value,data.thingy_id,data.some_date
        FROM
            -- Getting the groups associated with the contexts
            -- of all the stats available
           (SELECT DISTINCT context.group_id
            FROM all_stats AS stat
            INNER JOIN group_contexts AS groupcontext
                ON groupcontext.context_id = stat.context_id
        ) AS groups
        INNER JOIN
            -- Joining the available groups with the actual
            -- stat data of the group for a thingy
           (SELECT context.group_id,stat.value,stat.some_date,stat.thingy_id
            FROM all_stats AS stat
            INNER JOIN group_contexts AS groupcontext
                ON groupcontext.context_id = stat.context_id
            WHERE stat.value IS NOT NULL
              AND stat.value >= 0) AS data
        ON data.group_id = groups.group_id) AS joined
    ) AS filtered
-- I already have the thingies beforehand but if it would be possible
-- to include/query for them in another way that'd be OK by me
WHERE filtered.thingy_id in (/* somewhere around 10000 thingies are available */)

-- Now I'm building the `UNION ALL`s for each thingy as well as
-- the group the stat of the thingy belongs to

-- thingy 42 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 982
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
   (SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
    FROM #stats AS s
    WHERE s.group_id = 982
      AND s.thingy_id = 42
    ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3

UNION ALL

-- Getting the average of the most recent 3 stat items
-- for a thingy with id 42 in group 314159
SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
   (SELECT TOP 3 s.group_id,s.thingy_id,s.value,s.some_date
    FROM #stats AS s
    WHERE s.group_id = 314159
      AND s.thingy_id = 42
    ORDER BY s.some_date DESC) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3
-- }

UNION ALL

-- thingy 21 {
-- Getting the average of the most recent 3 stat items
-- for a thingy with id 21 in group 982
/* you get the idea */

这工作 - 缓慢,但它工作 - 对于小型数据集(例如,说 100 个事物,每个事物都有 10 个统计信息),但它最终必须工作的问题域是 10000 多个事物,每个事物可能有数百个统计数据。附带说明:生成的 SQL 查询非常大:一个非常小的查询涉及 350 个事物,这些事物在 3 个上下文组中具有数据,总计超过 250 000 条格式化的 SQL 行——在惊人的 5 分钟内执行。

因此,如果有人知道如何解决这个问题,我真的非常感谢您的帮助:-)。

4

1 回答 1

1

在您古老的 SQL Server 版本中,您需要使用一些旧式标量子查询来获取单个查询中所有事物的最后三行 :-)

SELECT x.group_id,x.thingy_id,AVG(x.value)
FROM
 (
   SELECT s.group_id,s.thingy_id,s.value
   FROM #stats AS s
   where (select count(*) from #stats as s2
          where s.group_id = s2.group_id
            and s.thingy_id = s2.thingy_id 
            and s.some_date <= s2.some_date
         ) <= 3
 ) AS x
GROUP BY x.group_id,x.thingy_id
HAVING COUNT(*) >= 3

为了获得更好的性能,您可能需要向表中添加聚集 (group_id,thingy_id,some_date desc,value)索引#stats

如果group_id,thingy_id,some_date是唯一的,您应该删除无用的ID列,否则order by group_id,thingy_id,some_date descInsert/Select进入#stats和使用期间ID而不是some_date查找最后三行。

于 2015-03-22T10:47:18.050 回答