sql - sql server parent-child join and slow query performance

Question

I have 2 tables (cannot change them)

Parent (id, date, amount)
Child (parent_id, key, value)

indexes

Parent.pk (id)
Parent.idx1 (id, date) include (amount)
Child.pk (parent_id, key)
Child.idx1 (parent_id, key, value)

and query

select sum(amount)
from Parent as p
left outer join Child as c1 on c1.parent_id = p.id and c1.key = 'X'
left outer join Child as c2 on c2.parent_id = p.id and c2.key = 'Y'
where p.date between '20120101' and '20120131'
and c1.value = 'x1'
and c2.value = 'y1'

Problem is performance.
Parent has ~1 500 000 records and Child ~6 000 000 records

Take 1

This query takes ~3sec which is too much for my scenario - it must be less than few milliseconds.

Execution plan shows me that SQL Server is doing index scan on Parent.idx1 and than merge join with Child.idx1 clustered index seek - which is not optimal because it scans whole 1500000 records even when I filter them by date.

Take 2

When I change Parent.idx1 to

Parent.idx1 (date, id) include (amount)

Sql server chooses Clustered index scan on Parent.pk and than again merge join with Child.idx1. Execution time is ~6s.

Take 3

When I force it to use Parent.idx1 (date, id) include (amount) then it sorts the result before merge join and execution time is even worse ~11s.

Take 4

Tried to create indexed view but cannot use it because of LEFT OUTER JOIN.

Is there any way to make such query - Parent-Child join with filters on both of them - faster?
Without de-normalization.

Update 2013-07-04:
To those answering use INNER JOIN - Yes it's much faster, but I cannot use it.
What I showed here is simplified version of what I really need.
I need to create SQL View for MS Dynamics NAV "G/L Entry" (Parent) and "Ledger Entry Dimension" (Child) tables so that I will be able to read it from that application. Complete view looks like this right now:

create view analysis
as
select 
    v.id as view_id
    , p.date
    , p.Amount
    , c1.value as value1
    , c2.value as value2
    , c3.value as value3
    , c4.value as value4
from Parent as p
    cross join analysis_view as v
    left outer join Child as c1 on c1.parent_id = p.id and c1.key = v.key1
    left outer join Child as c2 on c2.parent_id = p.id and c2.key = v.key2
    left outer join Child as c3 on c3.parent_id = p.id and c3.key = v.key3
    left outer join Child as c4 on c4.parent_id = p.id and c4.key = v.key4

where analysis_view contains 8 records currently and looks like this: analysis_view (id, key1, key2, key3, key4)
and then aplication may query it like this

select sum(amount)
from analysis
where view_id = 1 and date between '20120101' and '20120131'
and value1 = 'x1'
and value2 = 'x2'

or

select sum(amount)
from analysis
where view_id = 1 and date between '20120101' and '20120131'
and value1 = 'x1'
and value3 = 'z1'

MS Dynamics NAV already have de-normalized table for it and queries from it are fast, but it's huge in our case (~10GB) and locks the whole system for around one hour when somone creates new analysis view. Also NAV doesn't know how to produce joins, that's why I must define it on SQL Server side.

score 1 · Accepted Answer

将您的 LEFT JOIN 更改为 INNER JOIN。谓词c1.value = 'x1'无论如何都会丢弃最外面的行。

score 1 · Accepted Answer

我正在尝试几次，但我没有找到比修复索引更快的东西。

采取 1：创建一个处理父级和第一个子级的物化视图（在一个物化视图中不能有两个对同一个表的引用），然后在查询中将其连接到子级 - 并没有快多少。

采取 2：再次使用 parent 和 child2 创建第二个实体化视图，并在两个实体化视图之间使用连接 - 速度并不快。

采取 3：使用 INTERSECT 而不是 JOIN 将两个物化视图合并在一起——速度不会快多少。

采取 4：将具体化视图中的日期时间分解为年和月列 - 不会快多少（实际上更慢）

最大的问题似乎是您对子表有两次约束，这消除了以任何有效方式执行物化视图的能力。我可以编写一个物化视图，通过预先聚合它们，每月总计寻找具有“X”子键和“X1”值的父母，但没有足够的信息可以重新加入过滤器排除没有 child2 关系的金额。

那我很懒惰，试图用你拥有的数据量的 1/10 进行性能测试，无论我做什么，我的结果仍然非常快（<200ms）。我现在正在构建一整套测试数据，但显然我不知道你的分布是什么。了解这 1,500,000 条记录中有多少条记录有 X 个子项、Y 个子项以及 X 和 Y 个子项会有所帮助。如果这是一个固定查询或键/值将在运行时更改。

这是我的测试脚本：设置：

CREATE TABLE Parent (id int NOT NULL CONSTRAINT parent_pk PRIMARY KEY, date datetime, amount decimal(10,2) NOT NULL)
CREATE TABLE Child (parent_id int NOT NULL, [key] char(1) NOT NULL, value char(2) NOT NULL, CONSTRAINT child_pk PRIMARY KEY (parent_id,[key]))
CREATE INDEX Parent_IDX ON Parent (id,date,amount)
CREATE INDEX Child_IDX ON Child (parent_id,[key],value)

DECLARE @RowCount INT
DECLARE @Random INT
DECLARE @Upper INT
DECLARE @Lower INT
DECLARE @InsertDate DATETIME
DECLARE @keys INT
DECLARE @key INT

SET @Lower = 0
SET @Upper = 500
SET @RowCount = 0
WHILE @RowCount < 15000
BEGIN

SELECT @Random = ROUND(((@Upper - @Lower -1) * RAND() + @Lower), 0)
SET @InsertDate = DATEADD(dd, @Random, GETDATE())

INSERT INTO Parent(id,date,amount) 
VALUES (@RowCount , @InsertDate ,@Random)

SET @keys=ROUND(RAND()*3+1,0)
SET @key=0
WHILE @key<@keys
BEGIN
INSERT INTO Child(parent_id,[key],value)
VALUES (@RowCount,SUBSTRING('XYZ',@key+1,1),SUBSTRING('XYZ',@key+1,1)+'1')
SET @key=@key+1
END

SET @RowCount = @RowCount + 1
END

还有我的便签本：

SELECT COUNT(*) ParentCount FROM Parent
GO
SELECT COUNT(*) ChildCount FROM Child
GO
CREATE INDEX Parent_IDX2 ON Parent(date,id)
GO
CREATE VIEW blah WITH SCHEMABINDING AS
SELECT p.id,p.amount,DATEPART(YEAR,p.date) AS yy,DATEPART(Month,p.date) AS mm
from dbo.Parent as p
join dbo.Child as c1 on c1.parent_id = p.id and c1.[key] = 'X' and c1.value = 'x1'
--join dbo.Child as c2 on c2.parent_id = p.id and c2.[key] = 'Y' and c2.value = 'y1'
GO
CREATE UNIQUE CLUSTERED INDEX blah_pk ON blah (id)
CREATE INDEX blah_IDX ON blah (yy,mm,amount)
GO
CREATE VIEW blah2 WITH SCHEMABINDING AS
SELECT p.id,p.amount,DATEPART(YEAR,p.date) AS yy,DATEPART(Month,p.date) AS mm
from dbo.Parent as p
join dbo.Child as c1 on c1.parent_id = p.id and c1.[key] = 'Y' and c1.value = 'y1'

GO
CREATE UNIQUE CLUSTERED INDEX blah2_pk ON blah2 (id)
CREATE INDEX blah2_IDX ON blah2 (yy,mm,amount)
GO
select sum(amount)
from Parent as p
join Child as c1 on c1.parent_id = p.id and c1.[key] = 'X' and c1.value = 'x1'
join Child as c2 on c2.parent_id = p.id and c2.[key] = 'Y' and c2.value = 'y1'
where p.date between '20130801' and '20130831'
GO
select sum(amount)
from blah p
join Child as c2 on c2.parent_id = p.id and c2.[key] = 'Y' and c2.value = 'y1'
where p.yy=2013 and p.mm=8
GO
SELECT sum(blah.amount)
FROM blah
JOIN blah2 ON blah.id=blah2.id AND blah.yy=blah2.yy AND blah.mm=blah2.yy and blah.amount=blah2.amount
where blah.yy=2013 and blah.mm=8

SELECT SUM(amount)
FROM (
SELECT *
FROM blah
where blah.yy=2013 and blah.mm=8
INTERSECT
SELECT *
FROM blah2
where blah2.yy=2013 and blah2.mm=8
) t1

score 0 · Accepted Answer

有几件事会影响性能（尽管我不是专家）。其中之一是有一个索引，Child该表的每一列都作为索引的主列，这实际上没有意义。另一件事是您正在根据表的值过滤查询c1，c2并将查询转换为INNER JOIN. 您应该尝试修改它以改为使用EXISTS，如下所示：

select sum(amount)
from Parent as p
where p.date between '20120101' and '20120131'
and exists(select 1 from Child 
           where parent_id = p.id and key = 'X'
           and value = 'x1')
and exists(select 1 from Child 
           where parent_id = p.id and key = 'Y'
           and value = 'y1')

sql - sql server parent-child join and slow query performance

3 回答 3

Related

Reference