SQL Server 2012 似乎引入了CUME_DIST()和PERCENT_RANK用于计算列的累积分布。SQL Server 2008 中是否有等效的功能来实现这一点?
问问题
2699 次
4 回答
3
永远不要说永远,在 SQL 中。
该声明:
select percent_rank() over (partition by <x> order by <y>)
本质上等同于:
select row_number() over (partition by <x> order by <y>) / count(*) over (partition by <x>)
本质上意味着它在数据中没有重复项时有效。即使有重复,它也应该足够接近。
“真正”的答案是它相当于:
select row_number() over (partition by <x> order by <y>) / count(distinct <y>) over (partition by <x>)
但是,我们没有 count(distinct) 作为函数。而且,除非你真的需要,否则在 2008 年表达是一种痛苦。
函数 cume_dist() 更难,因为它需要一个累积和,你需要一个自连接。假设没有重复的近似值:
with t as (select <x>, <y>,
row_number() over (partition by <x> order by <y>) as seqnum
from <table>
)
select t.*, sumy*1.0 / sum(sumy) over (partition by <x>)
from (select t.*, sum(tprev.y) as sumy
from t left outer join
t tprev
on t.x = tprev.x and t.seqnum >= tprev.seqnum
) t
于 2012-05-16T02:14:12.183 回答
1
2012 年之前不存在等效函数,但一种可能的解决方法涉及递归 CTE,至少对于 < 32767 行的数据集。在这里,一对骰子被掷了 30 次:
SET NOCOUNT ON;
DECLARE @t TABLE(i INT);
DECLARE @i INT=0;
WHILE @i<30 BEGIN
INSERT INTO @t VALUES (CAST(RAND()*6 AS INT)+1 + CAST(RAND()*6 AS INT)+1);
SET @i+=1;
END
DECLARE @tc INT; SELECT @tc=COUNT(*) FROM @t;
WITH a AS (
SELECT *
, d=CAST(COUNT(1)OVER(PARTITION BY i ORDER BY i) AS DECIMAL(5,2)) / @tc
, r=ROW_NUMBER()OVER(ORDER BY i)
, pr=CAST((RANK()OVER(ORDER BY i)-1)AS DECIMAL(5,2)) / (@tc - 1)
FROM @t
)
, rcte (i, d, r, cd, pr) AS (
SELECT i, d, r, d, pr
FROM a
WHERE r=1
UNION ALL
SELECT a.i, a.d, a.r
, CASE WHEN rcte.i<>a.i THEN CAST(rcte.cd+a.d AS DECIMAL(5,2)) ELSE rcte.cd END
, a.pr
FROM a
INNER JOIN rcte ON rcte.r + 1 = a.r
)
SELECT i,cd,pr FROM rcte
OPTION (MAXRECURSION 32767)
结果:
i cd pr
----------- --------------------------------------- ---------------------------------------
2 0.0333333333333 0.0000000000000
3 0.0700000000000 0.0344827586206
4 0.2400000000000 0.0689655172413
4 0.2400000000000 0.0689655172413
4 0.2400000000000 0.0689655172413
4 0.2400000000000 0.0689655172413
4 0.2400000000000 0.0689655172413
5 0.3100000000000 0.2413793103448
5 0.3100000000000 0.2413793103448
6 0.3800000000000 0.3103448275862
6 0.3800000000000 0.3103448275862
7 0.5100000000000 0.3793103448275
7 0.5100000000000 0.3793103448275
7 0.5100000000000 0.3793103448275
7 0.5100000000000 0.3793103448275
8 0.6100000000000 0.5172413793103
8 0.6100000000000 0.5172413793103
8 0.6100000000000 0.5172413793103
9 0.8400000000000 0.6206896551724
9 0.8400000000000 0.6206896551724
9 0.8400000000000 0.6206896551724
9 0.8400000000000 0.6206896551724
9 0.8400000000000 0.6206896551724
9 0.8400000000000 0.6206896551724
9 0.8400000000000 0.6206896551724
10 0.8700000000000 0.8620689655172
11 0.9700000000000 0.8965517241379
11 0.9700000000000 0.8965517241379
11 0.9700000000000 0.8965517241379
12 1.0000000000000 1.0000000000000
下面是与上述 CTE 等效的 SQL 2012:
SELECT *
, cd=CUME_DIST()OVER(ORDER BY i)
, pr=PERCENT_RANK()OVER(ORDER BY i)
FROM @t;
于 2012-05-16T02:42:10.803 回答
0
这非常接近。首先是一些示例数据:
USE tempdb;
GO
CREATE TABLE dbo.DartScores
(
TournamentID INT,
PlayerID INT,
Score INT
);
INSERT dbo.DartScores VALUES
(1, 1, 320),
(1, 2, 340),
(1, 3, 310),
(1, 4, 370),
(2, 1, 310),
(2, 2, 280),
(2, 3, 370),
(2, 4, 370);
现在,查询的 2012 版本:
SELECT TournamentID, PlayerID, Score,
pr = PERCENT_RANK() OVER (PARTITION BY TournamentID ORDER BY Score),
cd = CUME_DIST() OVER (PARTITION BY TournamentID ORDER BY Score)
FROM dbo.DartScores
ORDER BY TournamentID, pr;
产生这个结果:
TournamentID PlayerID Score pr cd
1 3 310 0 0.25
1 1 320 0.333333333333333 0.5
1 2 340 0.666666666666667 0.75
1 4 370 1 1
2 2 280 0 0.25
2 1 310 0.333333333333333 0.5
2 3 370 0.666666666666667 1
2 4 370 0.666666666666667 1
2005 年的等价物非常接近,但它不能很好地处理关系。抱歉,我今晚没油了,否则我会帮忙找出原因。我从Itzik 的新高性能窗口函数手册中学到的知识与我所了解的一样多。
;WITH cte AS
(
SELECT TournamentID, PlayerID, Score,
rk = RANK() OVER (PARTITION BY TournamentID ORDER BY Score),
rn = COUNT(*) OVER (PARTITION BY TournamentID)
FROM dbo.DartScores
)
SELECT TournamentID, PlayerID, Score,
pr = 1e0*(rk-1)/(rn-1),
cd = 1e0*(SELECT COALESCE(MIN(cte2.rk)-1, cte.rn)
FROM cte AS cte2 WHERE cte2.rk > cte.rk) / rn
FROM cte;
产生这个结果(注意 cume_dist 值是如何为关系发生轻微变化的):
TournamentID PlayerID Score pr cd
1 3 310 0 0.25
1 1 320 0.333333333333333 0.5
1 2 340 0.666666666666667 0.75
1 4 370 1 1
2 2 280 0 0.25
2 1 310 0.333333333333333 0.5
2 3 370 0.666666666666667 0.75
2 4 370 0.666666666666667 0.75
不要忘记清理:
DROP TABLE dbo.DartScores;
于 2012-05-16T02:15:56.457 回答
0
是的,有一个简单的解决方案,至少对于 percent_rank() 部分。您可以使用
(rank() over (partition by <x> order by <y>)-1)/(count(*) over (partition by <x>)-1)
这会给你完全相同的结果
percent_rank() over (partition by <x> order by <y>)
rank() 函数是 SQL Server 2008 中已经存在的少数分析函数之一。
于 2015-09-30T15:27:34.810 回答