2

I have a table (lets call it audit) that looks like this:

+--------------------------------------------------------------------------+
| id | recordId | status | mdate                   | type  | relatedId     |
+--------------------------------------------------------------------------+
| 1  | 3006     | A      | 2013-04-03 23:59:01.275 | type1 | 1             |
| 2  | 3025     | B      | 2013-04-04 00:00:02.134 | type1 | 1             |
| 3  | 4578     | A      | 2013-04-04 00:04:30.033 | type2 | 1             |
| 4  | 7940     | C      | 2013-04-04 00:04:32.683 | type1 | <NULL>        |
| 5  | 3006     | D      | 2013-04-04 00:04:32.683 | type1 | <NULL>        |
| 6  | 4822     | E      | 2013-04-04 00:04:32.683 | type2 | <NULL>        |
| 7  | 3006     | A      | 2013-04-04 00:06:54.033 | type1 | 2             |
| 8  | 3025     | C      | 2013-04-04 00:06:54.033 | type1 | 2             |

...and on for millions of rows. And another table we'll call related:

+-------------+
| id | source |
+-------------+
| 1  | src_X  |
| 2  | src_Y  |
| 3  | src_Z  |
| 4  | src_X  |
| 5  | src_X  |

...and on for hundreds of thousands of rows.

There are more columns than these on both tables but this is all we need to describe the problem. The column relatedId joins to the related table. recordId also joins to another table, and there will be multiple entries in audit with the same recordId.

I'm trying to create a query that will produce the following output:

+-----------------+
| source  | count |
+-----------------+
| src_X   | 1643  |
| src_Y   | 255   |
| NULL    | 729   |
+-----------------+

The count is the number of records within audit that have a given type (eg. "type1") and are within a set of statuses (eg. "A", "B", "C") which are then left outer joined to related and grouped by source.

The catch is that I only want to include records from within audit that are within a certain date range, and I also only want to join from audit to related on the oldest entry within that range for each recordId. Further, I want to ignore any records that match the type and status criteria, but have an entry for the same recordId that is older than the range of dates.

So, to clarify from the above example data: Lets say I want a type of type1 and the status values of "A", "B", "C" with a date range of 2013-04-04 to 2013-04-05. Rows 2 and 4 would be included in the count. Row 3 is excluded as it has the incorrect type. Row 5 is excluded as the status is incorrect. Row 6 is excluded because the both the status and the type are incorrect. Row 1 is excluded as it is outside the date range. Row 7 is also excluded, as there is another row (row 1) that matches the status and type criteria with the same recordId that is older than the start of the date range. Row 8 is excluded as both row 8 and row 2 have the same recordId and match the criteria, but we only count the oldest record of the two within the range.

In other words, I want to count only the first time an entry for a given recordId appears in the table and is within the target date range.

We've come up with the following:

WITH data (recordId, id) AS (
    SELECT a.recordId, MIN(a.id)
    FROM audit a
    WHERE a.status in ('A','B','C')
        AND type = 'type1'
    GROUP BY a.recordId
)
SELECT r.source, COUNT(*)
FROM data d
    JOIN audit a ON d.id = a.id
    LEFT JOIN related r ON a.relatedId = r.id
WHERE a.mdate >= '2013-04-04 00:00:00.000'
    and a.mdate < '2013-04-05 00:00:00.000' 
GROUP BY r.source

This will be run on MSSQL Server 2008, and currently relies on the fact that the audit table id's are autogenerated. Since the id's are generated at the point the record is inserted, and the mdate is also the insert timestamp and the records are never updated once inserted, I think this is OK. The query appears to give the correct output on a limited set of test data, but I was hoping for a second opinion.

  • Does this query look ok?
  • Can its performance be improved?
4

1 回答 1

4

您可以使用该ROW_NUMBER()函数根据 RecordId 和 mDate 对记录进行排名,然后将结果限制在指定日期之间第一次出现的位置。

WITH data  AS 
(   SELECT  a.relatedId, a.mdate, rn = ROW_NUMBER() OVER(PARTITION BY a.RecordId ORDER BY a.mdate)
    FROM    audit a
    WHERE   a.status in ('A','B','C')
    AND     type = 'type1'
)
SELECT  r.source, [Count] = COUNT(*)
FROM    data d
        LEFT JOIN related r 
            ON d.relatedId = r.id
WHERE   d.rn = 1
AND     d.mdate >= '2013-04-04 00:00:00.000'
AND     d.mdate < '2013-04-05 00:00:00.000' 
GROUP BY r.source;

我不确定这是否会比您当前的解决方案执行得更好,但会解决依赖按时间顺序插入的问题。如果按时间顺序插入不是问题,您可以ORDER BY将函数内部更改ROW_NUMBER()为使用 ID,因为对聚集键进行排序会更快。

从外部看性能调优是非常困难的,为了猜测它,我们需要查看相关表上的索引,以及查询的执行计划。然后您可以识别瓶颈,以及哪些索引可以提高性能。

这个 SQL Fiddle显示了两个查询(我的和你的)最终得到相同的结果,但是当您查看 IO 统计信息时,您可以看到您的查询得到:

(2 row(s) affected)
Table 'Related'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Audit'. Scan count 2, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

使用 ROW_NUMBER() 你会得到:

(2 row(s) affected)
Table 'Related'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Audit'. Scan count 1, logical reads 1, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

关键因素是减少逻辑阅读。快速查看执行计划表明 ROW_NUMBER() 解决方案少了一个分支,估计为批处理成本的 37%,而您的解决方案为 63%,因此在这一小组数据上,它似乎是一个性能改进。

在此处输入图像描述

但是,我只能从这么小的数据样本中看出这么多,一些解决方案不能很好地扩展,正如我所说,这将取决于您的数据大小和分布。我的建议是尝试不同的解决方案,通过检查 IO 统计信息和执行计划找到瓶颈。

例如,查看 CTE 的执行计划,这占我查询的查询成本的 50%:

在此处输入图像描述

通过添加此索引:

CREATE INDEX IX_Audit_ALL ON Audit (recordId, MDate, RelatedID, status, type)

我能够将其减少到查询成本的 18%。

在此处输入图像描述

但是,实际上,在不了解更多信息的情况下,我不能肯定地说这个索引会 (a) 帮助这个查询处理您的数据,并且 (b) 它不会通过减慢插入/更新速度而导致您的数据库出现其他问题。

于 2013-06-26T14:01:01.170 回答