7

我正在使用 SQL Server 存储有关票证验证的数据。单票可以在多个地方进行验证。我需要按“进入”和“退出”位置对记录进行分组,并计算在两次验证之间通过的持续时间的统计信息。这是表格(为清楚起见进行了简化):

CREATE TABLE TestDuration
(VALIDATION_TIMESTAMP datetime, 
ID_TICKET bigint, 
ID_PLACE bigint)

和数据:

INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-25 19:24:05.700', 1, 1)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-25 20:08:04.250', 2, 2)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:13.040', 3, 3)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:20.990', 1, 2)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:18:29.290', 2, 4)
INSERT INTO TestDuration(VALIDATION_TIMESTAMP,ID_TICKET,ID_PLACE) VALUES ('2012-07-26 10:25:37.040', 1, 4)

这是聚合查询:

SELECT VisitDurationCalcTable.ID_PLACE AS ID_PLACE_IN, 
VisitDurationCalcTable.ID_NEXT_VISIT_PLACE AS ID_PLACE_OUT, 
COUNT(visitduration) AS NUMBER_OF_VISITS, AVG(visitduration) AS AVERAGE_VISIT_DURATION 
FROM (
      SELECT EntryData.VALIDATION_TIMESTAMP, EntryData.ID_TICKET, EntryData.ID_PLACE, 
      (
       SELECT TOP 1 ID_PLACE FROM TestDuration 
          WHERE ID_TICKET=EntryData.ID_TICKET 
          AND VALIDATION_TIMESTAMP>EntryData.VALIDATION_TIMESTAMP 
          ORDER BY VALIDATION_TIMESTAMP ASC
      ) 
      AS ID_NEXT_VISIT_PLACE, 
      DATEDIFF(n,EntryData.VALIDATION_TIMESTAMP,
               (
                SELECT TOP 1 VALIDATION_TIMESTAMP FROM TestDuration WHERE ID_TICKET=EntryData.ID_TICKET and VALIDATION_TIMESTAMP>EntryData.VALIDATION_TIMESTAMP ORDER BY VALIDATION_TIMESTAMP ASC
               )
              ) AS visitduration 
     FROM TestDuration EntryData)
AS VisitDurationCalcTable 
WHERE VisitDurationCalcTable.ID_NEXT_VISIT_PLACE IS NOT NULL
GROUP BY VisitDurationCalcTable.ID_PLACE, VisitDurationCalcTable.ID_NEXT_VISIT_PLACE

查询有效,但我很快就遇到了性能问题。对于 40K 行的表查询执行时间约为 3 分钟。我不是 SQL 专家,因此无法真正了解如何将查询转换为更快地工作。这不是一个重要的报告,大约每月只制作一次,但它让我的应用程序看起来很糟糕。我有一种感觉,我在这里错过了一些简单的东西。

4

1 回答 1

5

TLDR 版本

您显然缺少有助于此查询的索引。添加缺失的索引本身可能会导致一个数量级的改进。

如果您在 SQL Server 2012 上使用重写查询LEAD也可以这样做(尽管这也将受益于缺少的索引)。

如果您仍在 2005/2008 年,那么您可以对现有查询进行一些改进,但与索引更改相比,效果会相对较小。

更长的版本

为此需要 3 分钟,我假设您根本没有有用的索引,最大的胜利就是简单地添加一个索引(对于每月运行一次的报告,只需将三列中的数据复制到适当索引的#temp表中就足够了如果您不想创建永久索引)。

您说为清楚起见简化了表格,并且它有 40K 行。假设以下测试数据

CREATE TABLE TestDuration
  (
     Id                   UNIQUEIDENTIFIER DEFAULT NEWID() PRIMARY KEY,
     VALIDATION_TIMESTAMP DATETIME,
     ID_TICKET            BIGINT,
     ID_PLACE             BIGINT,
     OtherColumns         CHAR(100) NULL
  )

INSERT INTO TestDuration
            (VALIDATION_TIMESTAMP,
             ID_TICKET,
             ID_PLACE)
SELECT TOP 40000 DATEADD(minute, ROW_NUMBER() OVER (ORDER BY (SELECT 0)), GETDATE()),
                 ABS(CHECKSUM(NEWID())) % 10,
                 ABS(CHECKSUM(NEWID())) % 100
FROM   master..spt_values v1,
       master..spt_values v2 

您的原始查询在我的机器上花费了 51 秒时间MAXDOP 1和以下 IO 统计信息

Table 'Worktable'. Scan count 79990, logical reads 1167573, physical reads 0
Table 'TestDuration'. Scan count 3, logical reads 2472, physical reads 0.

计划一

对于表中的 40,000 行中的每一行,它正在对所有匹配ID_TICKET行进行两种排序,以便按顺序识别下一个VALIDATION_TIMESTAMP

只需添加如下索引即可将经过的时间降至 406 毫秒,提高了 100 多倍(此答案中的后续查询假设该索引现已到位)。

CREATE NONCLUSTERED INDEX IX
  ON TestDuration(ID_TICKET, VALIDATION_TIMESTAMP)
  INCLUDE (ID_PLACE) 

现在该计划如下所示,其中 80,000 次排序和假脱机操作被索引查找所取代。

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 79991, logical reads 255707, physical reads 0

计划2

然而,它仍在为每一行进行 2 次搜索。重写CROSS APPLY允许这些组合。

SELECT VisitDurationCalcTable.ID_PLACE            AS ID_PLACE_IN,
       VisitDurationCalcTable.ID_NEXT_VISIT_PLACE AS ID_PLACE_OUT,
       COUNT(visitduration)                       AS NUMBER_OF_VISITS,
       AVG(visitduration)                         AS AVERAGE_VISIT_DURATION
FROM   (SELECT EntryData.VALIDATION_TIMESTAMP,
               EntryData.ID_TICKET,
               EntryData.ID_PLACE,
               CA.ID_PLACE                                                          AS ID_NEXT_VISIT_PLACE,
               DATEDIFF(n, EntryData.VALIDATION_TIMESTAMP, CA.VALIDATION_TIMESTAMP) AS visitduration
        FROM   TestDuration EntryData
               CROSS APPLY (SELECT TOP 1 ID_PLACE,
                                         VALIDATION_TIMESTAMP
                            FROM   TestDuration
                            WHERE  ID_TICKET = EntryData.ID_TICKET
                                   AND VALIDATION_TIMESTAMP > EntryData.VALIDATION_TIMESTAMP
                            ORDER  BY VALIDATION_TIMESTAMP ASC) CA) AS VisitDurationCalcTable
GROUP  BY VisitDurationCalcTable.ID_PLACE,
          VisitDurationCalcTable.ID_NEXT_VISIT_PLACE 

这给了我 269 毫秒的经过时间

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 40001, logical reads 127988, physical reads 0

计划 3

虽然读取次数仍然很高,但查找都是读取刚刚被扫描读取的页面,因此它们都是缓存中的页面。使用表变量可以减少读取次数。

DECLARE @T TABLE (
  VALIDATION_TIMESTAMP DATETIME,
  ID_TICKET            BIGINT,
  ID_PLACE             BIGINT,
  RN                   INT
  PRIMARY KEY(ID_TICKET, RN) )

INSERT INTO @T
SELECT VALIDATION_TIMESTAMP,
       ID_TICKET,
       ID_PLACE,
       ROW_NUMBER() OVER (PARTITION BY ID_TICKET ORDER BY VALIDATION_TIMESTAMP) AS RN
FROM   TestDuration

SELECT T1.ID_PLACE                                                        AS ID_PLACE_IN,
       T2.ID_PLACE                                                        AS ID_PLACE_OUT,
       COUNT(*)                                                           AS NUMBER_OF_VISITS,
       AVG(DATEDIFF(n, T1.VALIDATION_TIMESTAMP, T2.VALIDATION_TIMESTAMP)) AS AVERAGE_VISIT_DURATION
FROM   @T T1
       INNER MERGE JOIN @T T2
         ON T1.ID_TICKET = T2.ID_TICKET
            AND T2.RN = T1.RN + 1
GROUP  BY T1.ID_PLACE,
          T2.ID_PLACE 

然而,对我来说,至少将经过的时间略微增加到 301 毫秒(插入为 43 毫秒 + 选择为 258 毫秒),但这仍然是代替创建永久索引的好选择。

(Insert)
Table 'TestDuration'. Scan count 1, logical reads 233, physical reads 0

(Select)
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table '#0C50D423'. Scan count 2, logical reads 372, physical reads 0

计划

最后,如果您使用的是 SQL Server 2012,您可以使用LEAD( SQL Fiddle )

WITH CTE
     AS (SELECT ID_PLACE AS ID_PLACE_IN,
                LEAD(ID_PLACE) OVER (PARTITION BY ID_TICKET 
                                         ORDER BY VALIDATION_TIMESTAMP) AS ID_PLACE_OUT,
                DATEDIFF(n, 
                         VALIDATION_TIMESTAMP, 
                         LEAD(VALIDATION_TIMESTAMP) OVER (PARTITION BY ID_TICKET 
                                                              ORDER BY VALIDATION_TIMESTAMP)) AS VISIT_DURATION
         FROM   TestDuration)
SELECT ID_PLACE_IN,
       ID_PLACE_OUT,
       COUNT(*)            AS NUMBER_OF_VISITS,
       AVG(VISIT_DURATION) AS AVERAGE_VISIT_DURATION
FROM   CTE
WHERE  ID_PLACE_OUT IS NOT NULL
GROUP  BY ID_PLACE_IN,
          ID_PLACE_OUT 

这给了我 249 毫秒的经过时间

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 1, logical reads 233, physical reads 0

计划 4

LEAD版本在没有索引的情况下也表现良好。省略最佳索引会为计划增加额外SORT的内容,这意味着它必须读取我的测试表上更广泛的聚集索引,但它仍然在 293 毫秒的经过时间内完成。

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0
Table 'TestDuration'. Scan count 1, logical reads 824, physical reads 0
于 2012-12-18T15:55:24.513 回答