I have a table (lets call it audit
) that looks like this:
+--------------------------------------------------------------------------+
| id | recordId | status | mdate | type | relatedId |
+--------------------------------------------------------------------------+
| 1 | 3006 | A | 2013-04-03 23:59:01.275 | type1 | 1 |
| 2 | 3025 | B | 2013-04-04 00:00:02.134 | type1 | 1 |
| 3 | 4578 | A | 2013-04-04 00:04:30.033 | type2 | 1 |
| 4 | 7940 | C | 2013-04-04 00:04:32.683 | type1 | <NULL> |
| 5 | 3006 | D | 2013-04-04 00:04:32.683 | type1 | <NULL> |
| 6 | 4822 | E | 2013-04-04 00:04:32.683 | type2 | <NULL> |
| 7 | 3006 | A | 2013-04-04 00:06:54.033 | type1 | 2 |
| 8 | 3025 | C | 2013-04-04 00:06:54.033 | type1 | 2 |
...and on for millions of rows. And another table we'll call related
:
+-------------+
| id | source |
+-------------+
| 1 | src_X |
| 2 | src_Y |
| 3 | src_Z |
| 4 | src_X |
| 5 | src_X |
...and on for hundreds of thousands of rows.
There are more columns than these on both tables but this is all we need to describe the problem. The column relatedId
joins to the related
table. recordId
also joins to another table, and there will be multiple entries in audit
with the same recordId
.
I'm trying to create a query that will produce the following output:
+-----------------+
| source | count |
+-----------------+
| src_X | 1643 |
| src_Y | 255 |
| NULL | 729 |
+-----------------+
The count is the number of records within audit
that have a given type
(eg. "type1"
) and are within a set of statuses (eg. "A", "B", "C"
) which are then left outer joined to related
and grouped by source
.
The catch is that I only want to include records from within audit
that are within a certain date range, and I also only want to join from audit
to related
on the oldest entry within that range for each recordId
. Further, I want to ignore any records that match the type
and status
criteria, but have an entry for the same recordId
that is older than the range of dates.
So, to clarify from the above example data: Lets say I want a type of type1
and the status values of "A", "B", "C"
with a date range of 2013-04-04
to 2013-04-05
. Rows 2 and 4 would be included in the count. Row 3 is excluded as it has the incorrect type
. Row 5 is excluded as the status is incorrect. Row 6 is excluded because the both the status and the type are incorrect. Row 1 is excluded as it is outside the date range. Row 7 is also excluded, as there is another row (row 1) that matches the status and type criteria with the same recordId
that is older than the start of the date range. Row 8 is excluded as both row 8 and row 2 have the same recordId
and match the criteria, but we only count the oldest record of the two within the range.
In other words, I want to count only the first time an entry for a given recordId appears in the table and is within the target date range.
We've come up with the following:
WITH data (recordId, id) AS (
SELECT a.recordId, MIN(a.id)
FROM audit a
WHERE a.status in ('A','B','C')
AND type = 'type1'
GROUP BY a.recordId
)
SELECT r.source, COUNT(*)
FROM data d
JOIN audit a ON d.id = a.id
LEFT JOIN related r ON a.relatedId = r.id
WHERE a.mdate >= '2013-04-04 00:00:00.000'
and a.mdate < '2013-04-05 00:00:00.000'
GROUP BY r.source
This will be run on MSSQL Server 2008, and currently relies on the fact that the audit table id's are autogenerated. Since the id's are generated at the point the record is inserted, and the mdate is also the insert timestamp and the records are never updated once inserted, I think this is OK. The query appears to give the correct output on a limited set of test data, but I was hoping for a second opinion.
- Does this query look ok?
- Can its performance be improved?