根据MSDN,Median 在 Transact-SQL 中不能用作聚合函数。但是,我想知道是否可以创建此功能(使用Create Aggregate函数、用户定义函数或其他方法)。
执行此操作的最佳方法(如果可能)是什么 - 允许在聚合查询中计算中值(假设为数字数据类型)?
根据MSDN,Median 在 Transact-SQL 中不能用作聚合函数。但是,我想知道是否可以创建此功能(使用Create Aggregate函数、用户定义函数或其他方法)。
执行此操作的最佳方法(如果可能)是什么 - 允许在聚合查询中计算中值(假设为数字数据类型)?
如果您使用的是 SQL 2005 或更好的版本,这是对表中单个列的一个不错的、简单的中位数计算:
SELECT
(
(SELECT MAX(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score) AS BottomHalf)
+
(SELECT MIN(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts ORDER BY Score DESC) AS TopHalf)
) / 2 AS Median
2019 年更新:在我写下这个答案的 10 年里,已经发现了更多可能产生更好结果的解决方案。此外,此后的 SQL Server 版本(尤其是 SQL 2012)引入了可用于计算中位数的新 T-SQL 功能。SQL Server 版本还改进了它的查询优化器,这可能会影响各种中值解决方案的性能。Net-net,我最初的 2009 年帖子仍然可以,但对于现代 SQL Server 应用程序可能有更好的解决方案。看看 2012 年的这篇文章,这是一个很好的资源: https ://sqlperformance.com/2012/08/t-sql-queries/median
本文发现以下模式比所有其他替代方案都要快得多,至少在他们测试的简单模式上是这样。此解决方案比测试的最慢 () 解决方案快 373 倍 (!!!) PERCENTILE_CONT
。请注意,此技巧需要两个单独的查询,这可能并非在所有情况下都实用。它还需要 SQL 2012 或更高版本。
DECLARE @c BIGINT = (SELECT COUNT(*) FROM dbo.EvenRows);
SELECT AVG(1.0 * val)
FROM (
SELECT val FROM dbo.EvenRows
ORDER BY val
OFFSET (@c - 1) / 2 ROWS
FETCH NEXT 1 + (1 - @c % 2) ROWS ONLY
) AS x;
当然,仅仅因为 2012 年对一种模式的一次测试产生了很好的结果,您的里程可能会有所不同,尤其是如果您使用的是 SQL Server 2014 或更高版本。如果性能对您的中位数计算很重要,我强烈建议您尝试并测试该文章中推荐的几个选项,以确保您找到了最适合您的架构的选项。
我还要特别小心使用该问题的其他答案PERCENTILE_CONT
之一中推荐的(SQL Server 2012 中的新功能)函数,因为上面链接的文章发现这个内置函数比最快的解决方案慢 373 倍。从那以后的 7 年中,这种差异可能有所改善,但我个人不会在大桌子上使用此功能,直到我验证了它与其他解决方案的性能。
2009 年的原始帖子如下:
有很多方法可以做到这一点,性能差异很大。这是一个特别优化的解决方案,来自Medians、ROW_NUMBERs 和 performance。对于执行期间生成的实际 I/O,这是一个特别优化的解决方案——它看起来比其他解决方案成本更高,但实际上速度要快得多。
该页面还包含对其他解决方案和性能测试详细信息的讨论。请注意使用唯一列作为消歧器,以防多行具有相同的中值列。
与所有数据库性能方案一样,始终尝试使用真实硬件上的真实数据来测试解决方案——您永远不知道何时更改 SQL Server 的优化器或环境中的特殊性会使通常快速的解决方案变慢。
SELECT
CustomerId,
AVG(TotalDue)
FROM
(
SELECT
CustomerId,
TotalDue,
-- SalesOrderId in the ORDER BY is a disambiguator to break ties
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue ASC, SalesOrderId ASC) AS RowAsc,
ROW_NUMBER() OVER (
PARTITION BY CustomerId
ORDER BY TotalDue DESC, SalesOrderId DESC) AS RowDesc
FROM Sales.SalesOrderHeader SOH
) x
WHERE
RowAsc IN (RowDesc, RowDesc - 1, RowDesc + 1)
GROUP BY CustomerId
ORDER BY CustomerId;
在 SQL Server 2012 中,您应该使用PERCENTILE_CONT:
SELECT SalesOrderID, OrderQty,
PERCENTILE_CONT(0.5)
WITHIN GROUP (ORDER BY OrderQty)
OVER (PARTITION BY SalesOrderID) AS MedianCont
FROM Sales.SalesOrderDetail
WHERE SalesOrderID IN (43670, 43669, 43667, 43663)
ORDER BY SalesOrderID DESC
我最初的快速回答是:
select max(my_column) as [my_column], quartile
from (select my_column, ntile(4) over (order by my_column) as [quartile]
from my_table) i
--where quartile = 2
group by quartile
这将一举为您提供中位数和四分位数范围。如果您真的只想要中位数的一行,请取消注释 where 子句。
当您将其纳入解释计划时,60% 的工作是对数据进行排序,这在计算像这样的位置相关统计数据时是不可避免的。
我已经修改了答案,以遵循 Robert Ševčík-Robajz 在以下评论中的出色建议:
;with PartitionedData as
(select my_column, ntile(10) over (order by my_column) as [percentile]
from my_table),
MinimaAndMaxima as
(select min(my_column) as [low], max(my_column) as [high], percentile
from PartitionedData
group by percentile)
select
case
when b.percentile = 10 then cast(b.high as decimal(18,2))
else cast((a.low + b.high) as decimal(18,2)) / 2
end as [value], --b.high, a.low,
b.percentile
from MinimaAndMaxima a
join MinimaAndMaxima b on (a.percentile -1 = b.percentile) or (a.percentile = 10 and b.percentile = 10)
--where b.percentile = 5
当您有偶数个数据项时,这应该计算正确的中位数和百分位数。同样,如果您只想要中位数而不是整个百分位数分布,请取消注释最后的 where 子句。
更好的是:
SELECT @Median = AVG(1.0 * val)
FROM
(
SELECT o.val, rn = ROW_NUMBER() OVER (ORDER BY o.val), c.c
FROM dbo.EvenRows AS o
CROSS JOIN (SELECT c = COUNT(*) FROM dbo.EvenRows) AS c
) AS x
WHERE rn IN ((c + 1)/2, (c + 2)/2);
来自大师本人,Itzik Ben-Gan!
MS SQL Server 2012(及更高版本)具有 PERCENTILE_DISC 函数,该函数计算排序值的特定百分位数。PERCENTILE_DISC (0.5) 将计算中位数 - https://msdn.microsoft.com/en-us/library/hh231327.aspx
简单、快速、准确
SELECT x.Amount
FROM (SELECT amount,
Count(1) OVER (partition BY 'A') AS TotalRows,
Row_number() OVER (ORDER BY Amount ASC) AS AmountOrder
FROM facttransaction ft) x
WHERE x.AmountOrder = Round(x.TotalRows / 2.0, 0)
如果您想在 SQL Server 中使用 Create Aggregate 函数,请执行以下操作。这样做的好处是能够编写干净的查询。请注意,此过程可以很容易地用于计算百分比值。
创建一个新的 Visual Studio 项目并将目标框架设置为 .NET 3.5(这是针对 SQL 2008,在 SQL 2012 中可能会有所不同)。然后创建一个类文件并放入以下代码,或等效的 C# 代码:
Imports Microsoft.SqlServer.Server
Imports System.Data.SqlTypes
Imports System.IO
<Serializable>
<SqlUserDefinedAggregate(Format.UserDefined, IsInvariantToNulls:=True, IsInvariantToDuplicates:=False, _
IsInvariantToOrder:=True, MaxByteSize:=-1, IsNullIfEmpty:=True)>
Public Class Median
Implements IBinarySerialize
Private _items As List(Of Decimal)
Public Sub Init()
_items = New List(Of Decimal)()
End Sub
Public Sub Accumulate(value As SqlDecimal)
If Not value.IsNull Then
_items.Add(value.Value)
End If
End Sub
Public Sub Merge(other As Median)
If other._items IsNot Nothing Then
_items.AddRange(other._items)
End If
End Sub
Public Function Terminate() As SqlDecimal
If _items.Count <> 0 Then
Dim result As Decimal
_items = _items.OrderBy(Function(i) i).ToList()
If _items.Count Mod 2 = 0 Then
result = ((_items((_items.Count / 2) - 1)) + (_items(_items.Count / 2))) / 2@
Else
result = _items((_items.Count - 1) / 2)
End If
Return New SqlDecimal(result)
Else
Return New SqlDecimal()
End If
End Function
Public Sub Read(r As BinaryReader) Implements IBinarySerialize.Read
'deserialize it from a string
Dim list = r.ReadString()
_items = New List(Of Decimal)
For Each value In list.Split(","c)
Dim number As Decimal
If Decimal.TryParse(value, number) Then
_items.Add(number)
End If
Next
End Sub
Public Sub Write(w As BinaryWriter) Implements IBinarySerialize.Write
'serialize the list to a string
Dim list = ""
For Each item In _items
If list <> "" Then
list += ","
End If
list += item.ToString()
Next
w.Write(list)
End Sub
End Class
然后编译它并将 DLL 和 PDB 文件复制到您的 SQL Server 机器并在 SQL Server 中运行以下命令:
CREATE ASSEMBLY CustomAggregate FROM '{path to your DLL}'
WITH PERMISSION_SET=SAFE;
GO
CREATE AGGREGATE Median(@value decimal(9, 3))
RETURNS decimal(9, 3)
EXTERNAL NAME [CustomAggregate].[{namespace of your DLL}.Median];
GO
然后,您可以编写一个查询来计算中位数,如下所示:SELECT dbo.Median(Field) FROM Table
在 UDF 中,编写:
Select Top 1 medianSortColumn from Table T
Where (Select Count(*) from Table
Where MedianSortColumn <
(Select Count(*) From Table) / 2)
Order By medianSortColumn
尽管贾斯汀格兰特的解决方案看起来很可靠,但我发现当您在给定的分区键中有多个重复值时,ASC 重复值的行号最终会乱序,因此它们无法正确对齐。
这是我的结果的一个片段:
KEY VALUE ROWA ROWD
13 2 22 182
13 1 6 183
13 1 7 184
13 1 8 185
13 1 9 186
13 1 10 187
13 1 11 188
13 1 12 189
13 0 1 190
13 0 2 191
13 0 3 192
13 0 4 193
13 0 5 194
我使用 Justin 的代码作为此解决方案的基础。尽管考虑到使用多个派生表效率不高,但它确实解决了我遇到的行排序问题。任何改进都会受到欢迎,因为我在 T-SQL 方面没有那么丰富的经验。
SELECT PKEY, cast(AVG(VALUE)as decimal(5,2)) as MEDIANVALUE
FROM
(
SELECT PKEY,VALUE,ROWA,ROWD,
'FLAG' = (CASE WHEN ROWA IN (ROWD,ROWD-1,ROWD+1) THEN 1 ELSE 0 END)
FROM
(
SELECT
PKEY,
cast(VALUE as decimal(5,2)) as VALUE,
ROWA,
ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY ROWA DESC) as ROWD
FROM
(
SELECT
PKEY,
VALUE,
ROW_NUMBER() OVER (PARTITION BY PKEY ORDER BY VALUE ASC,PKEY ASC ) as ROWA
FROM [MTEST]
)T1
)T2
)T3
WHERE FLAG = '1'
GROUP BY PKEY
ORDER BY PKEY
我只是在寻找基于集合的中位数解决方案时遇到了这个页面。在查看了这里的一些解决方案之后,我想出了以下内容。希望有帮助/有效。
DECLARE @test TABLE(
i int identity(1,1),
id int,
score float
)
INSERT INTO @test (id,score) VALUES (1,10)
INSERT INTO @test (id,score) VALUES (1,11)
INSERT INTO @test (id,score) VALUES (1,15)
INSERT INTO @test (id,score) VALUES (1,19)
INSERT INTO @test (id,score) VALUES (1,20)
INSERT INTO @test (id,score) VALUES (2,20)
INSERT INTO @test (id,score) VALUES (2,21)
INSERT INTO @test (id,score) VALUES (2,25)
INSERT INTO @test (id,score) VALUES (2,29)
INSERT INTO @test (id,score) VALUES (2,30)
INSERT INTO @test (id,score) VALUES (3,20)
INSERT INTO @test (id,score) VALUES (3,21)
INSERT INTO @test (id,score) VALUES (3,25)
INSERT INTO @test (id,score) VALUES (3,29)
DECLARE @counts TABLE(
id int,
cnt int
)
INSERT INTO @counts (
id,
cnt
)
SELECT
id,
COUNT(*)
FROM
@test
GROUP BY
id
SELECT
drv.id,
drv.start,
AVG(t.score)
FROM
(
SELECT
MIN(t.i)-1 AS start,
t.id
FROM
@test t
GROUP BY
t.id
) drv
INNER JOIN @test t ON drv.id = t.id
INNER JOIN @counts c ON t.id = c.id
WHERE
t.i = ((c.cnt+1)/2)+drv.start
OR (
t.i = (((c.cnt+1)%2) * ((c.cnt+2)/2))+drv.start
AND ((c.cnt+1)%2) * ((c.cnt+2)/2) <> 0
)
GROUP BY
drv.id,
drv.start
以下查询从一列中的值列表返回中位数。它不能用作聚合函数或与聚合函数一起使用,但您仍然可以将它用作内部选择中带有 WHERE 子句的子查询。
SQL Server 2005+:
SELECT TOP 1 value from
(
SELECT TOP 50 PERCENT value
FROM table_name
ORDER BY value
)for_median
ORDER BY value DESC
上面贾斯汀的例子非常好。但是应该非常清楚地说明主键需求。我已经看到没有密钥的野外代码,结果很糟糕。
我对 Percentile_Cont 的抱怨是它不会为您提供数据集中的实际值。要从数据集中获得作为实际值的“中位数”,请使用 Percentile_Disc。
SELECT SalesOrderID, OrderQty,
PERCENTILE_DISC(0.5)
WITHIN GROUP (ORDER BY OrderQty)
OVER (PARTITION BY SalesOrderID) AS MedianCont
FROM Sales.SalesOrderDetail
WHERE SalesOrderID IN (43670, 43669, 43667, 43663)
ORDER BY SalesOrderID DESC
中值发现
这是查找属性中位数的最简单方法。
Select round(S.salary,4) median from employee S
where (select count(salary) from station
where salary < S.salary ) = (select count(salary) from station
where salary > S.salary)
使用单个语句 - 一种方法是使用 ROW_NUMBER()、COUNT() 窗口函数并过滤子查询。这是找到工资中位数:
SELECT AVG(e_salary)
FROM
(SELECT
ROW_NUMBER() OVER(ORDER BY e_salary) as row_no,
e_salary,
(COUNT(*) OVER()+1)*0.5 AS row_half
FROM Employee) t
WHERE row_no IN (FLOOR(row_half),CEILING(row_half))
我在网上看到过使用 FLOOR 和 CEILING 的类似解决方案,但尝试使用单个语句。(已编辑)
在此处查看 SQL 中位数计算的其他解决方案:“使用 MySQL 计算中位数的简单方法”(这些解决方案大多与供应商无关)。
对于大规模数据集,你可以试试这个 GIST:
https://gist.github.com/chrisknoll/1b38761ce8c5016ec5b2
它通过聚合您在集合中找到的不同值(例如年龄或出生年份等)来工作,并使用 SQL 窗口函数来定位您在查询中指定的任何百分位位置。
基于上面 Jeff Atwood 的回答,它使用 GROUP BY 和相关子查询来获取每个组的中位数。
SELECT TestID,
(
(SELECT MAX(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts WHERE TestID = Posts_parent.TestID ORDER BY Score) AS BottomHalf)
+
(SELECT MIN(Score) FROM
(SELECT TOP 50 PERCENT Score FROM Posts WHERE TestID = Posts_parent.TestID ORDER BY Score DESC) AS TopHalf)
) / 2 AS MedianScore,
AVG(Score) AS AvgScore, MIN(Score) AS MinScore, MAX(Score) AS MaxScore
FROM Posts_parent
GROUP BY Posts_parent.TestID
对于“table1”中的连续变量/度量“col1”
select col1
from
(select top 50 percent col1,
ROW_NUMBER() OVER(ORDER BY col1 ASC) AS Rowa,
ROW_NUMBER() OVER(ORDER BY col1 DESC) AS Rowd
from table1 ) tmp
where tmp.Rowa = tmp.Rowd
通常,我们可能不仅需要为整个表计算 Median,还需要针对某个 ID 计算聚合。换句话说,计算我们表中每个 ID 的中位数,其中每个 ID 都有许多记录。(基于@gdoron 编辑的解决方案:良好的性能并适用于许多 SQL)
SELECT our_id, AVG(1.0 * our_val) as Median
FROM
( SELECT our_id, our_val,
COUNT(*) OVER (PARTITION BY our_id) AS cnt,
ROW_NUMBER() OVER (PARTITION BY our_id ORDER BY our_val) AS rnk
FROM our_table
) AS x
WHERE rnk IN ((cnt + 1)/2, (cnt + 2)/2) GROUP BY our_id;
希望能帮助到你。
我想自己想出一个解决方案,但我的大脑在路上绊倒了。我认为它有效,但不要让我在早上解释它。:P
DECLARE @table AS TABLE
(
Number int not null
);
insert into @table select 2;
insert into @table select 4;
insert into @table select 9;
insert into @table select 15;
insert into @table select 22;
insert into @table select 26;
insert into @table select 37;
insert into @table select 49;
DECLARE @Count AS INT
SELECT @Count = COUNT(*) FROM @table;
WITH MyResults(RowNo, Number) AS
(
SELECT RowNo, Number FROM
(SELECT ROW_NUMBER() OVER (ORDER BY Number) AS RowNo, Number FROM @table) AS Foo
)
SELECT AVG(Number) FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2)
--Create Temp Table to Store Results in
DECLARE @results AS TABLE
(
[Month] datetime not null
,[Median] int not null
);
--This variable will determine the date
DECLARE @IntDate as int
set @IntDate = -13
WHILE (@IntDate < 0)
BEGIN
--Create Temp Table
DECLARE @table AS TABLE
(
[Rank] int not null
,[Days Open] int not null
);
--Insert records into Temp Table
insert into @table
SELECT
rank() OVER (ORDER BY DATEADD(mm, DATEDIFF(mm, 0, DATEADD(ss, SVR.close_date, '1970')), 0), DATEDIFF(day,DATEADD(ss, SVR.open_date, '1970'),DATEADD(ss, SVR.close_date, '1970')),[SVR].[ref_num]) as [Rank]
,DATEDIFF(day,DATEADD(ss, SVR.open_date, '1970'),DATEADD(ss, SVR.close_date, '1970')) as [Days Open]
FROM
mdbrpt.dbo.View_Request SVR
LEFT OUTER JOIN dbo.dtv_apps_systems vapp
on SVR.category = vapp.persid
LEFT OUTER JOIN dbo.prob_ctg pctg
on SVR.category = pctg.persid
Left Outer Join [mdbrpt].[dbo].[rootcause] as [Root Cause]
on [SVR].[rootcause]=[Root Cause].[id]
Left Outer Join [mdbrpt].[dbo].[cr_stat] as [Status]
on [SVR].[status]=[Status].[code]
LEFT OUTER JOIN [mdbrpt].[dbo].[net_res] as [net]
on [net].[id]=SVR.[affected_rc]
WHERE
SVR.Type IN ('P')
AND
SVR.close_date IS NOT NULL
AND
[Status].[SYM] = 'Closed'
AND
SVR.parent is null
AND
[Root Cause].[sym] in ( 'RC - Application','RC - Hardware', 'RC - Operational', 'RC - Unknown')
AND
(
[vapp].[appl_name] in ('3PI','Billing Rpts/Files','Collabrent','Reports','STMS','STMS 2','Telco','Comergent','OOM','C3-BAU','C3-DD','DIRECTV','DIRECTV Sales','DIRECTV Self Care','Dealer Website','EI Servlet','Enterprise Integration','ET','ICAN','ODS','SB-SCM','SeeBeyond','Digital Dashboard','IVR','OMS','Order Services','Retail Services','OSCAR','SAP','CTI','RIO','RIO Call Center','RIO Field Services','FSS-RIO3','TAOS','TCS')
OR
pctg.sym in ('Systems.Release Health Dashboard.Problem','DTV QA Test.Enterprise Release.Deferred Defect Log')
AND
[Net].[nr_desc] in ('3PI','Billing Rpts/Files','Collabrent','Reports','STMS','STMS 2','Telco','Comergent','OOM','C3-BAU','C3-DD','DIRECTV','DIRECTV Sales','DIRECTV Self Care','Dealer Website','EI Servlet','Enterprise Integration','ET','ICAN','ODS','SB-SCM','SeeBeyond','Digital Dashboard','IVR','OMS','Order Services','Retail Services','OSCAR','SAP','CTI','RIO','RIO Call Center','RIO Field Services','FSS-RIO3','TAOS','TCS')
)
AND
DATEADD(mm, DATEDIFF(mm, 0, DATEADD(ss, SVR.close_date, '1970')), 0) = DATEADD(mm, DATEDIFF(mm,0,DATEADD(mm,@IntDate,getdate())), 0)
ORDER BY [Days Open]
DECLARE @Count AS INT
SELECT @Count = COUNT(*) FROM @table;
WITH MyResults(RowNo, [Days Open]) AS
(
SELECT RowNo, [Days Open] FROM
(SELECT ROW_NUMBER() OVER (ORDER BY [Days Open]) AS RowNo, [Days Open] FROM @table) AS Foo
)
insert into @results
SELECT
DATEADD(mm, DATEDIFF(mm,0,DATEADD(mm,@IntDate,getdate())), 0) as [Month]
,AVG([Days Open])as [Median] FROM MyResults WHERE RowNo = (@Count+1)/2 OR RowNo = ((@Count+1)%2) * ((@Count+2)/2)
set @IntDate = @IntDate+1
DELETE FROM @table
END
select *
from @results
order by [Month]
这适用于 SQL 2000:
DECLARE @testTable TABLE
(
VALUE INT
)
--INSERT INTO @testTable -- Even Test
--SELECT 3 UNION ALL
--SELECT 5 UNION ALL
--SELECT 7 UNION ALL
--SELECT 12 UNION ALL
--SELECT 13 UNION ALL
--SELECT 14 UNION ALL
--SELECT 21 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 29 UNION ALL
--SELECT 40 UNION ALL
--SELECT 56
--
--INSERT INTO @testTable -- Odd Test
--SELECT 3 UNION ALL
--SELECT 5 UNION ALL
--SELECT 7 UNION ALL
--SELECT 12 UNION ALL
--SELECT 13 UNION ALL
--SELECT 14 UNION ALL
--SELECT 21 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 23 UNION ALL
--SELECT 29 UNION ALL
--SELECT 39 UNION ALL
--SELECT 40 UNION ALL
--SELECT 56
DECLARE @RowAsc TABLE
(
ID INT IDENTITY,
Amount INT
)
INSERT INTO @RowAsc
SELECT VALUE
FROM @testTable
ORDER BY VALUE ASC
SELECT AVG(amount)
FROM @RowAsc ra
WHERE ra.id IN
(
SELECT ID
FROM @RowAsc
WHERE ra.id -
(
SELECT MAX(id) / 2.0
FROM @RowAsc
) BETWEEN 0 AND 1
)
对于像我这样正在学习基础知识的新手来说,我个人觉得这个例子更容易理解,因为更容易准确理解正在发生的事情以及中值的来源......
select
( max(a.[Value1]) + min(a.[Value1]) ) / 2 as [Median Value1]
,( max(a.[Value2]) + min(a.[Value2]) ) / 2 as [Median Value2]
from (select
datediff(dd,startdate,enddate) as [Value1]
,xxxxxxxxxxxxxx as [Value2]
from dbo.table1
)a
对上面的一些代码绝对敬畏!!!
这是我能想到的最简单的答案。与我的数据配合得很好。如果要排除某些值,只需在内部选择中添加 where 子句。
SELECT TOP 1
ValueField AS MedianValue
FROM
(SELECT TOP(SELECT COUNT(1)/2 FROM tTABLE)
ValueField
FROM
tTABLE
ORDER BY
ValueField) A
ORDER BY
ValueField DESC
以下解决方案在这些假设下有效:
代码:
IF OBJECT_ID('dbo.R', 'U') IS NOT NULL
DROP TABLE dbo.R
CREATE TABLE R (
A FLOAT NOT NULL);
INSERT INTO R VALUES (1);
INSERT INTO R VALUES (2);
INSERT INTO R VALUES (3);
INSERT INTO R VALUES (4);
INSERT INTO R VALUES (5);
INSERT INTO R VALUES (6);
-- Returns Median(R)
select SUM(A) / CAST(COUNT(A) AS FLOAT)
from R R1
where ((select count(A) from R R2 where R1.A > R2.A) =
(select count(A) from R R2 where R1.A < R2.A)) OR
((select count(A) from R R2 where R1.A > R2.A) + 1 =
(select count(A) from R R2 where R1.A < R2.A)) OR
((select count(A) from R R2 where R1.A > R2.A) =
(select count(A) from R R2 where R1.A < R2.A) + 1) ;
DECLARE @Obs int
DECLARE @RowAsc table
(
ID INT IDENTITY,
Observation FLOAT
)
INSERT INTO @RowAsc
SELECT Observations FROM MyTable
ORDER BY 1
SELECT @Obs=COUNT(*)/2 FROM @RowAsc
SELECT Observation AS Median FROM @RowAsc WHERE ID=@Obs
我尝试了几种选择,但由于我的数据记录有重复值,ROW_NUMBER 版本似乎不是我的选择。所以这里是我使用的查询(带有 NTILE 的版本):
SELECT distinct
CustomerId,
(
MAX(CASE WHEN Percent50_Asc=1 THEN TotalDue END) OVER (PARTITION BY CustomerId) +
MIN(CASE WHEN Percent50_desc=1 THEN TotalDue END) OVER (PARTITION BY CustomerId)
)/2 MEDIAN
FROM
(
SELECT
CustomerId,
TotalDue,
NTILE(2) OVER (
PARTITION BY CustomerId
ORDER BY TotalDue ASC) AS Percent50_Asc,
NTILE(2) OVER (
PARTITION BY CustomerId
ORDER BY TotalDue DESC) AS Percent50_desc
FROM Sales.SalesOrderHeader SOH
) x
ORDER BY CustomerId;
对于您的问题,Jeff Atwood 已经给出了简单有效的解决方案。但是,如果您正在寻找一些替代方法来计算中位数,下面的 SQL 代码将为您提供帮助。
create table employees(salary int);
insert into employees values(8); insert into employees values(23); insert into employees values(45); insert into employees values(123); insert into employees values(93); insert into employees values(2342); insert into employees values(2238);
select * from employees;
declare @odd_even int; declare @cnt int; declare @middle_no int;
set @cnt=(select count(*) from employees); set @middle_no=(@cnt/2)+1; select @odd_even=case when (@cnt%2=0) THEN -1 ELse 0 END ;
select AVG(tbl.salary) from (select salary,ROW_NUMBER() over (order by salary) as rno from employees group by salary) tbl where tbl.rno=@middle_no or tbl.rno=@middle_no+@odd_even;
如果你想在 MySQL 中计算中位数,这个github 链接会很有用。
这是我能想到的寻找中位数的最佳解决方案。示例中的名称基于 Justin 示例。确保存在表 Sales.SalesOrderHeader 的索引,索引列 CustomerId 和 TotalDue 以该顺序存在。
SELECT
sohCount.CustomerId,
AVG(sohMid.TotalDue) as TotalDueMedian
FROM
(SELECT
soh.CustomerId,
COUNT(*) as NumberOfRows
FROM
Sales.SalesOrderHeader soh
GROUP BY soh.CustomerId) As sohCount
CROSS APPLY
(Select
soh.TotalDue
FROM
Sales.SalesOrderHeader soh
WHERE soh.CustomerId = sohCount.CustomerId
ORDER BY soh.TotalDue
OFFSET sohCount.NumberOfRows / 2 - ((sohCount.NumberOfRows + 1) % 2) ROWS
FETCH NEXT 1 + ((sohCount.NumberOfRows + 1) % 2) ROWS ONLY
) As sohMid
GROUP BY sohCount.CustomerId
更新
我有点不确定哪种方法的性能最好,所以我通过在一批中运行基于所有三种方法的查询,对我的方法 Justin Grants 和 Jeff Atwoods 进行了比较,每个查询的批处理成本为:
无索引:
并带有索引
我试图通过从大约 14 000 行创建更多数据到 512 来查看查询的扩展程度,这意味着最终大约有 720 万行。请注意,我确保每次执行单个副本时,CustomeId 字段都是唯一的,因此与 CustomerId 的唯一实例相比,行的比例保持不变。当我这样做时,我运行了执行,之后我重建了索引,我注意到结果稳定在 128 倍左右,而我拥有这些值的数据:
我想知道如何通过缩放行数但保持唯一的 CustomerId 不变来影响性能,所以我设置了一个新的测试,我就是这样做的。现在批量成本比率并没有稳定下来,而是不断变化,而不是平均每个 CustomerId 大约 20 行,我最终每个这样的唯一 ID 大约 10000 行。其中的数字:
我通过比较结果确保我正确实施了每种方法。我的结论是,只要索引存在,我使用的方法通常会更快。还注意到此方法是本文中针对此特定问题推荐的方法https://www.microsoftpressstore.com/articles/article.aspx?p=2314819&seqNum=5
进一步提高对该查询的后续调用性能的一种方法是将计数信息保存在辅助表中。您甚至可以通过使用触发器来维护它,该触发器更新并保存有关依赖于 CustomerId 的 SalesOrderHeader 行数的信息,当然您也可以简单地存储中位数。
with tempa as
(
select value,row_number() over (order by value) as Rn,/* Assigning a
row_number */
count(value) over () as Cnt /*Taking total count of the values */
from numbers
where value is not null /* Excluding the null values */
),
tempb as
(
/* Since we don't know whether the number of rows is odd or even, we shall
consider both the scenarios */
select round(cnt/2) as Ref from tempa where mod(cnt,2)=1
union all
select round(cnt/2) a Ref from tempa where mod(cnt,2)=0
union all
select round(cnt/2) + 1 as Ref from tempa where mod(cnt,2)=0
)
select avg(value) as Median_Value
from tempa where rn in
( select Ref from tempb);
使用 COUNT 聚合,您可以首先计算有多少行并存储在一个名为 @cnt 的变量中。然后,您可以计算 OFFSET-FETCH 过滤器的参数,以根据数量排序指定要跳过的行数(偏移值)和要过滤的行数(获取值)。
要跳过的行数是 (@cnt - 1) / 2。很明显,对于奇数计数,此计算是正确的,因为在除以 2 之前,您首先为单个中间值减去 1。
这也适用于偶数计数,因为表达式中使用的除法是整数除法;所以,当从偶数中减去 1 时,你会得到一个奇数。
当将该奇数值除以 2 时,结果的小数部分 (.5) 将被截断。要获取的行数是 2 - (@cnt % 2)。这个想法是,当计数为奇数时,模运算的结果为 1,您需要获取 1 行。当计数为偶数时,取模运算的结果为 0,需要取 2 行。通过从 2 中减去模运算的 1 或 0 结果,您将分别得到所需的 1 或 2。最后,要计算中位数,取一或两个结果量,并在将输入整数值转换为数字后应用平均值,如下所示:
DECLARE @cnt AS INT = (SELECT COUNT(*) FROM [Sales].[production].[stocks]);
SELECT AVG(1.0 * quantity) AS median
FROM ( SELECT quantity
FROM [Sales].[production].[stocks]
ORDER BY quantity
OFFSET (@cnt - 1) / 2 ROWS FETCH NEXT 2 - @cnt % 2 ROWS ONLY ) AS D;
从员工表中获取工资的中值
with cte as (select salary, ROW_NUMBER() over (order by salary asc) as num from employees)
select avg(salary) from cte where num in ((select (count(*)+1)/2 from employees), (select (count(*)+2)/2 from employees));
在我的解决方案表中是一个只有标记列的学生表,我正在计算分数的中位数,这个解决方案基于 SQL Server 2019
with total_c as ( --Total_c CTE counts total number of rows in a table
select count(*) as n from student
),
even as ( --Even CTE extract two middle rows if the number of rows are even
select marks from student
order by marks
offset (select n from total_c)/2 -1 rows
fetch next 2 rows only
),
odd as ( --Odd CTE extract middle row if the number of rows are odd
select marks from student
order by marks
offset (select n + 1 from total_c)/2 -1 rows
fetch next 1 rows only
)
--Case statement helps to select odd or even CTE based on number of rows
select
case when n%2 = 0 then (select avg(cast(marks as float)) from even)
else (select marks from odd)
end as med_marks
from total_c
这段代码有点长,但是很容易理解
medi 是具有列“val”的表,该列具有数据集,smedi 是一个 cte,其列 idx 作为行号,vals 作为 medi 表中的“val”,按升序排序。然后是它的基本数学,如果行号是奇数,那么它的中间值来自 smedi。当它甚至是中间两个值的平均值时。
with smedi(idx,vals) as(
select ROW_NUMBER() over(order by val),val from medi
)
select (case
when (select count(*) from medi)%2!=0 then (select vals from smedi where (((select count(*) from medi)/2))=idx)
else (select avg(vals) from smedi where idx in ((select count(*)/2 from medi),(select (count(*)/2)+1 from medi)))
end)
尝试以下逻辑找出中位数:
考虑一个具有以下数字的表格:1,1,2,3,4,5
中位数为 2.5
与 tempa 为 ( select num,count(num) over() as Cnt, row_number() over (order by num) as Rnum from temp), tempb as ( select round(cnt/2) as ref_value from tempa where mod(cnt ,2)<>0 union all select round(cnt/2) from tempa where mod(cnt,2)=0 union all select round(cnt/2+1) from tempa where mod(cnt,2)=0 ) 选择avg(num) from tempa where rnum in (select * from tempb);