sql - 如何改进此查询 - 分区是这里的最佳选择吗？

Question

我有一个名为 Transactions 的表，它目前包含 6+ 百万行（每月大约 600-700千）它看起来像这样：

pk                                                           id          acct_id     id1         id2         id3         id4         created                 interface_id source_lvl1 source_lvl2 trans_type
------------------------------------------------------------ ----------- ----------- ----------- ----------- ----------- ----------- ----------------------- ------------ ----------- ----------- -----------
10000257.4297...400245990.3.1002                             10000257    4297        NULL        NULL        NULL        NULL        2012-09-06 11:26:30.000 1            32002       1002        3
10004819.1529.106.105442.400667675.6.1021                    10004819    1529        106         105442      62          NULL        2012-09-11 08:34:35.000 4            32002       1021        6
10004819.1529.18664647.62.400667675.3.1021                   10004819    1529        18664647    62          NULL        NULL        2012-09-11 08:34:35.000 4            32002       1021        3
10006460.1529.106.105442.400667675.6.1021                    10006460    1529        106         105442      62          NULL        2012-09-11 08:34:35.000 4            32002       1021        6
10006460.1529.18664647.62.400667675.3.1021                   10006460    1529        18664647    62          NULL        NULL        2012-09-11 08:34:35.000 4            32002       1021        3
10006648.3280...406204785.3.1002                             10006648    3280        NULL        NULL        NULL        NULL        2012-11-14 10:39:45.000 6            32002       1002        3
10006834.1529.106.105442.400667675.6.1021                    10006834    1529        106         105442      62          NULL        2012-09-11 08:34:35.000 4            32002       1021        6
10006834.1529.18664647.62.400667675.3.1021                   10006834    1529        18664647    62          NULL        NULL        2012-09-11 08:34:35.000 4            32002       1021        3
10006962.2428...415795811.3.1018                             10006962    2428        NULL        NULL        NULL        NULL        2013-03-05 10:50:11.000 1            32002       1018        3
10006962.2428.107972..415795811.4.1018                       10006962    2428        107972      NULL        NULL        NULL        2013-03-05 10:50:11.000 1            32002       1018        4

我已经定义了一个视图，它应该有助于计算特定事件：

这是sql定义：

CREATE VIEW [dbo].[Queue_base]

AS

select 
dateadd(minute , (DATEPART(minute,t.created)/30)*30 , DATEADD(hour,datediff(hour, 0, t.created), 0)) INTRVL_UTC,
dateadd(minute , (DATEPART(minute,t.created)/30)*30 + 30 , DATEADD(hour,datediff(hour, 0, t.created), 0)) INTRVL_END_UTC,
a.ID [Agent ID], a.Login, a.DisplayName, a.GroupName, q.QueueID, q.QueueName, 
    TODATETIMEOFFSET(t.created,0) created   
,i.ReferenceNumber, t.id inc_id
, case when (t.trans_type=17 and t.source_lvl2 not IN (1001, 2001)) or (t.trans_type=6 and t.id1=8) then t.id else null end [Workload]
, case when (t.trans_type=6 and t.id1=8 and t.source_lvl2 not IN (1001, 2001) or (t.trans_type=17 and not t.source_lvl2 IN (1001,2001)))then t.id else null end [Inbound Emails]
, case when t.trans_type=17 and t.id1=q.QueueID then t.id else null end [EnQueued]
, case when t.trans_type=17 and t.id2=q.QueueID then t.id else null end [DeQueued]
, case when t.trans_type=6 and t.id1 IN (2,106) then t.id else null end [Solved]
, case when t.trans_type=6 and t.id1 =8 then t.id else null end [Updated]
, case when x.StatusTypeID = 2 then t.id else null end [Reopened]
, case when t.trans_type=6 and t.id1=125 then t.id else null end [Spam]
, case when t.trans_type=8 and t.acct_id <> 1 then t.id else null end [Responded]
, case when i.cr_rec_element_1 is not null or i.de_reason1 is not null then t.id else null end [Complaint]
,t.trans_type, t.id1
,r.Brand, r.Region, r.[Call Center], r.LOB, r.[LOB Detail], r.Team, r.Subteam, r.Channel
,r.Interface, r.Product, r.[Product Detail], r.Unit
from Transactions t 
left join
(
select a.*, b.id1, st.StatusTypeID
from
(select  
t1.pk, t1.id, t1.created,   max(t2.created) maxdate
from Transactions t1 
    left join Transactions t2 
    on t1.id=t2.id and t2.created<t1.created and t2.trans_type=6
 left join Status st on t2.id1=st.StatusID
 where t1.trans_type=6 and t1.id1=8
group by t1.pk, t1.id, t1.created) a left join Transactions b on a.id=b.id and b.created=a.maxdate and b.trans_type=6
left join Status st on b.id1=st.statusid
)
x on t.pk=x.pk
left join Incident i on t.id=i.id
left join Account a on t.acct_id=a.ID
left join Queue q ON  (t.trans_type=17 and (t.id1=q.QueueID or t.id2=q.QueueID) or t.trans_type IN (6,8) and t.id3=q.QueueID) 
left join queuedim r ON (q.QueueName=r.QueueName or q.QueueName is null and r.QueueName is null) 
    and (q.QueueID=r.QueueID or q.QueueID is null and r.QueueID is null)
where t.trans_type=17 or t.trans_type IN (6,8)

这是视图的关键部分：

inc_id      Workload    Inbound Emails EnQueued    DeQueued    Solved      Updated     Reopened    Spam        Responded   Complaint
----------- ----------- -------------- ----------- ----------- ----------- ----------- ----------- ----------- ----------- -----------
10209648    NULL        NULL           NULL        NULL        10209648    NULL        NULL        NULL        NULL        NULL
10209648    NULL        NULL           NULL        NULL        NULL        NULL        NULL        NULL        10209648    NULL
10209648    10209648    NULL           NULL        NULL        NULL        10209648    NULL        NULL        NULL        NULL
10227966    NULL        NULL           NULL        NULL        NULL        NULL        NULL        10227966    NULL        NULL
10288343    NULL        NULL           NULL        NULL        10288343    NULL        NULL        NULL        NULL        NULL
10303898    NULL        NULL           NULL        NULL        10303898    NULL        NULL        NULL        NULL        NULL
10394204    NULL        NULL           NULL        NULL        NULL        NULL        NULL        10394204    NULL        NULL
10409624    NULL        NULL           NULL        NULL        10409624    NULL        NULL        NULL        NULL        NULL
10482071    NULL        NULL           NULL        NULL        NULL        NULL        NULL        10482071    NULL        NULL
10485993    NULL        NULL           NULL        NULL        NULL        NULL        NULL        10485993    NULL        NULL

我的计划是创建另一个表并使用我感兴趣的汇总结果连续更新它，按日期期间和其他维度的组合进行分组。问题是我需要对上面描述的事件进行不同和简单的计数，但是，虽然后一种视图很快产生了原始结果，但另一个带有计数的查询需要很长时间：

    --  month   account
declare @d1 date
declare @d2 date

set @d1 = '2013-05-01'
set @d2 = '2013-06-01'
--insert into IncPerfQueue
select x.Brand, x.Region, x.[Call Center], x.LOB, x.[LOB Detail], x.Team, x.Subteam,
x.QueueName, case when x.[Agent ID] is null then 0 else [Agent ID] end,  c.[month], NULL weekstart, NULL [date]

, count(distinct EnQueued) [Distinct Incidents EnQueued]
, count(distinct DeQueued) [Distinct Incidents DeQueued]
, count(distinct Solved) [Distinct Incidents Solved in the queue]
, COUNT(distinct Responded) [Distinct Incidents Responded in the queue]
, COUNT(distinct Updated)   [Distinct Incidents Updated in the queue]
, count(distinct Reopened) [Distinct Incidents ReOpened in the queue]
, count(distinct Spam) [Distinct Spam closed in the queue]
, COUNT([Inbound Emails]) [Inbound Emails]
, COUNT(Workload) [Workload]
, count(EnQueued) [# EnQueued]
, count(DeQueued) [# DeQueued]
, count(Solved) [# Solved in the queue]
, COUNT(Responded) [# Responded in the queue]
, COUNT(Updated) [#Updated in the queue]
, count(Reopened) [# ReOpened in the queue]
, count(Spam) [# Spam closed in the queue]

from Queue_base x
join [calendar] c ON convert(date,x.created)=c.date
where x.created >= @d1 and x.created < @d2
and Brand is not null
group by x.Brand, x.Region, x.[Call Center], x.LOB, x.[LOB Detail], x.Team, x.Subteam,
x.QueueName, [Agent ID], c.month

这只是所需的查询之一，因为需要针对不同维度进行单独的聚合（每个分组的计数不同），并且花费了 1 多个小时！ http://i.stack.imgur.com/oWibJ.png

我将感谢您就此类查询中最好的方法提出建议。基表肯定会很快变得更大……我应该对它进行分区吗？我还应该注意，这里引用的所有表都已编入索引并且我正在使用：Microsoft SQL Server 2008 R2 (SP2)(X64) 安装在配备 2 个 X5550 处理器和 48GB RAM 的机器上，操作系统是 Windows Server 2008 R2 Enterprise。

谢谢，马朱

score 0 · Accepted Answer

查看您的查询，我猜您的大部分成本都在左连接和分组依据中。在不牺牲正确性的情况下，您可能无法对左连接做太多事情，但我会看看您的查询计划，看看这些 group by 的成本是多少。

由于您有数百万行，我猜您的查询计划中的排序占用了 90% 以上的时间。在这些按列分组上添加一些索引确实有助于将这些排序转换为索引扫描。扫描该索引肯定会比排序更快，每次都有效地重新创建这些索引。如果您可以发布您的查询计划，那将非常有帮助。拥有较小版本的数据（每个表可能有几千行）可能会对您有所帮助，这样您就可以使用索引而无需等待很长时间来创建它们。

我认为分区对于减少查询时间是必要的。分区仅在 IO 级别有帮助。既然你说的是几百万行，我猜所有相关表的所有相关列都只有几千兆字节。从磁盘加载它需要几分钟，但不是一个小时。当 SQL Server 处理一个小时的查询时，它肯定会将大部分这些东西保存在内存中。因此，分区在这里不会有太大帮助。一旦您的查询计划不再有热点，并且您发现 io 是查询中的瓶颈，那么我会考虑分区。

您可以在 ( http://msdn.microsoft.com/en-us/library/ms184361.aspx )上使用 set statistics io 检查 IO 。

sql - 如何改进此查询 - 分区是这里的最佳选择吗？

1 回答 1

Related

Reference