SQL Server 2005/2008 中是否有任何线性回归函数,类似于Oracle 中的线性回归函数?
8 回答
据我所知,没有。不过,写一个非常简单。下面给出了 y = Alpha + Beta * x + epsilon 的常数 alpha 和斜率 beta:
-- test data (GroupIDs 1, 2 normal regressions, 3, 4 = no variance)
WITH some_table(GroupID, x, y) AS
( SELECT 1, 1, 1 UNION SELECT 1, 2, 2 UNION SELECT 1, 3, 1.3
UNION SELECT 1, 4, 3.75 UNION SELECT 1, 5, 2.25 UNION SELECT 2, 95, 85
UNION SELECT 2, 85, 95 UNION SELECT 2, 80, 70 UNION SELECT 2, 70, 65
UNION SELECT 2, 60, 70 UNION SELECT 3, 1, 2 UNION SELECT 3, 1, 3
UNION SELECT 4, 1, 2 UNION SELECT 4, 2, 2),
-- linear regression query
/*WITH*/ mean_estimates AS
( SELECT GroupID
,AVG(x * 1.) AS xmean
,AVG(y * 1.) AS ymean
FROM some_table
GROUP BY GroupID
),
stdev_estimates AS
( SELECT pd.GroupID
-- T-SQL STDEV() implementation is not numerically stable
,CASE SUM(SQUARE(x - xmean)) WHEN 0 THEN 1
ELSE SQRT(SUM(SQUARE(x - xmean)) / (COUNT(*) - 1)) END AS xstdev
, SQRT(SUM(SQUARE(y - ymean)) / (COUNT(*) - 1)) AS ystdev
FROM some_table pd
INNER JOIN mean_estimates pm ON pm.GroupID = pd.GroupID
GROUP BY pd.GroupID, pm.xmean, pm.ymean
),
standardized_data AS -- increases numerical stability
( SELECT pd.GroupID
,(x - xmean) / xstdev AS xstd
,CASE ystdev WHEN 0 THEN 0 ELSE (y - ymean) / ystdev END AS ystd
FROM some_table pd
INNER JOIN stdev_estimates ps ON ps.GroupID = pd.GroupID
INNER JOIN mean_estimates pm ON pm.GroupID = pd.GroupID
),
standardized_beta_estimates AS
( SELECT GroupID
,CASE WHEN SUM(xstd * xstd) = 0 THEN 0
ELSE SUM(xstd * ystd) / (COUNT(*) - 1) END AS betastd
FROM standardized_data pd
GROUP BY GroupID
)
SELECT pb.GroupID
,ymean - xmean * betastd * ystdev / xstdev AS Alpha
,betastd * ystdev / xstdev AS Beta
FROM standardized_beta_estimates pb
INNER JOIN stdev_estimates ps ON ps.GroupID = pb.GroupID
INNER JOIN mean_estimates pm ON pm.GroupID = pb.GroupID
这里GroupID
用于显示如何按源数据表中的某个值进行分组。如果您只想要表中所有数据的统计信息(不是特定的子组),您可以删除它和连接。WITH
为了清楚起见,我使用了该声明。作为替代方案,您可以改用子查询。请注意表中使用的数据类型的精度,因为如果精度相对于您的数据不够高,数值稳定性会迅速恶化。
编辑:(回答彼得关于评论中 R2 等其他统计数据的问题)
您可以使用相同的技术轻松计算其他统计数据。这是具有 R2、相关性和样本协方差的版本:
-- test data (GroupIDs 1, 2 normal regressions, 3, 4 = no variance)
WITH some_table(GroupID, x, y) AS
( SELECT 1, 1, 1 UNION SELECT 1, 2, 2 UNION SELECT 1, 3, 1.3
UNION SELECT 1, 4, 3.75 UNION SELECT 1, 5, 2.25 UNION SELECT 2, 95, 85
UNION SELECT 2, 85, 95 UNION SELECT 2, 80, 70 UNION SELECT 2, 70, 65
UNION SELECT 2, 60, 70 UNION SELECT 3, 1, 2 UNION SELECT 3, 1, 3
UNION SELECT 4, 1, 2 UNION SELECT 4, 2, 2),
-- linear regression query
/*WITH*/ mean_estimates AS
( SELECT GroupID
,AVG(x * 1.) AS xmean
,AVG(y * 1.) AS ymean
FROM some_table pd
GROUP BY GroupID
),
stdev_estimates AS
( SELECT pd.GroupID
-- T-SQL STDEV() implementation is not numerically stable
,CASE SUM(SQUARE(x - xmean)) WHEN 0 THEN 1
ELSE SQRT(SUM(SQUARE(x - xmean)) / (COUNT(*) - 1)) END AS xstdev
, SQRT(SUM(SQUARE(y - ymean)) / (COUNT(*) - 1)) AS ystdev
FROM some_table pd
INNER JOIN mean_estimates pm ON pm.GroupID = pd.GroupID
GROUP BY pd.GroupID, pm.xmean, pm.ymean
),
standardized_data AS -- increases numerical stability
( SELECT pd.GroupID
,(x - xmean) / xstdev AS xstd
,CASE ystdev WHEN 0 THEN 0 ELSE (y - ymean) / ystdev END AS ystd
FROM some_table pd
INNER JOIN stdev_estimates ps ON ps.GroupID = pd.GroupID
INNER JOIN mean_estimates pm ON pm.GroupID = pd.GroupID
),
standardized_beta_estimates AS
( SELECT GroupID
,CASE WHEN SUM(xstd * xstd) = 0 THEN 0
ELSE SUM(xstd * ystd) / (COUNT(*) - 1) END AS betastd
FROM standardized_data
GROUP BY GroupID
)
SELECT pb.GroupID
,ymean - xmean * betastd * ystdev / xstdev AS Alpha
,betastd * ystdev / xstdev AS Beta
,CASE ystdev WHEN 0 THEN 1 ELSE betastd * betastd END AS R2
,betastd AS Correl
,betastd * xstdev * ystdev AS Covar
FROM standardized_beta_estimates pb
INNER JOIN stdev_estimates ps ON ps.GroupID = pb.GroupID
INNER JOIN mean_estimates pm ON pm.GroupID = pb.GroupID
EDIT 2STDEV
通过标准化数据(而不仅仅是居中)和由于数值稳定性问题而替换来提高数值稳定性。对我来说,当前的实现似乎是稳定性和复杂性之间的最佳权衡。我可以通过用数值稳定的在线算法替换我的标准偏差来提高稳定性,但这会使实现大大复杂化(并减慢它的速度)。类似地,使用例如 Kahan(-Babuška-Neumaier) 补偿的实现在有限的测试SUM
中AVG
似乎表现得更好,但使查询更加复杂。而且只要我不知道 T-SQL 是如何实现SUM
和AVG
(例如,它可能已经在使用成对求和),我不能保证这样的修改总能提高准确性。
这是一种替代方法,基于关于 T-SQL 中的线性回归的博客文章,它使用以下等式:
博客中的 SQL 建议使用游标。这是我使用的论坛答案的美化版本:
table
-----
X (numeric)
Y (numeric)
/**
* m = (nSxy - SxSy) / (nSxx - SxSx)
* b = Ay - (Ax * m)
* N.B. S = Sum, A = Mean
*/
DECLARE @n INT
SELECT @n = COUNT(*) FROM table
SELECT (@n * SUM(X*Y) - SUM(X) * SUM(Y)) / (@n * SUM(X*X) - SUM(X) * SUM(X)) AS M,
AVG(Y) - AVG(X) *
(@n * SUM(X*Y) - SUM(X) * SUM(Y)) / (@n * SUM(X*X) - SUM(X) * SUM(X)) AS B
FROM table
我实际上已经使用 Gram-Schmidt 正交化编写了一个 SQL 例程。它以及其他机器学习和预测例程可在sqldatamine.blogspot.com上获得
在 Brad Larson 的建议下,我在这里添加了代码,而不仅仅是将用户引导到我的博客。这会产生与 Excel 中的 linest 函数相同的结果。我的主要来源是 Hastie、Tibshirni 和 Friedman 的 Elements of Statistical Learning (2008)。
--Create a table of data
create table #rawdata (id int,area float, rooms float, odd float, price float)
insert into #rawdata select 1, 2201,3,1,400
insert into #rawdata select 2, 1600,3,0,330
insert into #rawdata select 3, 2400,3,1,369
insert into #rawdata select 4, 1416,2,1,232
insert into #rawdata select 5, 3000,4,0,540
--Insert the data into x & y vectors
select id xid, 0 xn,1 xv into #x from #rawdata
union all
select id, 1,rooms from #rawdata
union all
select id, 2,area from #rawdata
union all
select id, 3,odd from #rawdata
select id yid, 0 yn, price yv into #y from #rawdata
--create a residuals table and insert the intercept (1)
create table #z (zid int, zn int, zv float)
insert into #z select id , 0 zn,1 zv from #rawdata
--create a table for the orthoganal (#c) & regression(#b) parameters
create table #c(cxn int, czn int, cv float)
create table #b(bn int, bv float)
--@p is the number of independent variables including the intercept (@p = 0)
declare @p int
set @p = 1
--Loop through each independent variable and estimate the orthagonal parameter (#c)
-- then estimate the residuals and insert into the residuals table (#z)
while @p <= (select max(xn) from #x)
begin
insert into #c
select xn cxn, zn czn, sum(xv*zv)/sum(zv*zv) cv
from #x join #z on xid = zid where zn = @p-1 and xn>zn group by xn, zn
insert into #z
select zid, xn,xv- sum(cv*zv)
from #x join #z on xid = zid join #c on czn = zn and cxn = xn where xn = @p and zn<xn group by zid, xn,xv
set @p = @p +1
end
--Loop through each independent variable and estimate the regression parameter by regressing the orthoganal
-- resiuduals on the dependent variable y
while @p>=0
begin
insert into #b
select zn, sum(yv*zv)/ sum(zv*zv)
from #z join
(select yid, yv-isnull(sum(bv*xv),0) yv from #x join #y on xid = yid left join #b on xn=bn group by yid, yv) y
on zid = yid where zn = @p group by zn
set @p = @p-1
end
--The regression parameters
select * from #b
--Actual vs. fit with error
select yid, yv, fit, yv-fit err from #y join
(select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f
on yid = xid
--R Squared
select 1-sum(power(err,2))/sum(power(yv,2)) from
(select yid, yv, fit, yv-fit err from #y join
(select xid, sum(xv*bv) fit from #x join #b on xn = bn group by xid) f
on yid = xid) d
SQL Server 中没有线性回归函数。但是要计算数据点对 x,y 之间的简单线性回归 (Y' = bX + A) - 包括计算相关系数、确定系数 (R^2) 和标准误差估计(标准偏差),请执行下列操作:
对于regression_data
具有数字列x
和的表y
:
declare @total_points int
declare @intercept DECIMAL(38, 10)
declare @slope DECIMAL(38, 10)
declare @r_squared DECIMAL(38, 10)
declare @standard_estimate_error DECIMAL(38, 10)
declare @correlation_coefficient DECIMAL(38, 10)
declare @average_x DECIMAL(38, 10)
declare @average_y DECIMAL(38, 10)
declare @sumX DECIMAL(38, 10)
declare @sumY DECIMAL(38, 10)
declare @sumXX DECIMAL(38, 10)
declare @sumYY DECIMAL(38, 10)
declare @sumXY DECIMAL(38, 10)
declare @Sxx DECIMAL(38, 10)
declare @Syy DECIMAL(38, 10)
declare @Sxy DECIMAL(38, 10)
Select
@total_points = count(*),
@average_x = avg(x),
@average_y = avg(y),
@sumX = sum(x),
@sumY = sum(y),
@sumXX = sum(x*x),
@sumYY = sum(y*y),
@sumXY = sum(x*y)
from regression_data
set @Sxx = @sumXX - (@sumX * @sumX) / @total_points
set @Syy = @sumYY - (@sumY * @sumY) / @total_points
set @Sxy = @sumXY - (@sumX * @sumY) / @total_points
set @correlation_coefficient = @Sxy / SQRT(@Sxx * @Syy)
set @slope = (@total_points * @sumXY - @sumX * @sumY) / (@total_points * @sumXX - power(@sumX,2))
set @intercept = @average_y - (@total_points * @sumXY - @sumX * @sumY) / (@total_points * @sumXX - power(@sumX,2)) * @average_x
set @r_squared = (@intercept * @sumY + @slope * @sumXY - power(@sumY,2) / @total_points) / (@sumYY - power(@sumY,2) / @total_points)
-- calculate standard_estimate_error (standard deviation)
Select
@standard_estimate_error = sqrt(sum(power(y - (@slope * x + @intercept),2)) / @total_points)
From regression_data
这里它是一个函数,它接受一个表类型:table (Y float, X double),它被称为 XYDoubleType 并假设我们的线性函数是 AX + B 的形式。它返回 A 和 B 一个表列以防万一你想把它放在一个连接或其他东西中
CREATE FUNCTION FN_GetABForData(
@XYData as XYDoubleType READONLY
) RETURNS @ABData TABLE(
A FLOAT,
B FLOAT,
Rsquare FLOAT )
AS
BEGIN
DECLARE @sx FLOAT, @sy FLOAT
DECLARE @sxx FLOAT,@syy FLOAT, @sxy FLOAT,@sxsy FLOAT, @sxsx FLOAT, @sysy FLOAT
DECLARE @n FLOAT, @A FLOAT, @B FLOAT, @Rsq FLOAT
SELECT @sx =SUM(D.X) ,@sy =SUM(D.Y), @sxx=SUM(D.X*D.X),@syy=SUM(D.Y*D.Y),
@sxy =SUM(D.X*D.Y),@n =COUNT(*)
From @XYData D
SET @sxsx =@sx*@sx
SET @sxsy =@sx*@sy
SET @sysy = @sy*@sy
SET @A = (@n*@sxy -@sxsy)/(@n*@sxx -@sxsx)
SET @B = @sy/@n - @A*@sx/@n
SET @Rsq = POWER((@n*@sxy -@sxsy),2)/((@n*@sxx-@sxsx)*(@n*@syy -@sysy))
INSERT INTO @ABData (A,B,Rsquare) VALUES(@A,@B,@Rsq)
RETURN
END
要添加到@icc97 答案,我已经包含了斜率和截距的加权版本。如果这些值都是恒定的,则斜率将为 NULL(使用适当的设置SET ARITHABORT OFF; SET ANSI_WARNINGS OFF;
)并且需要通过 coalesce() 替换为 0。
这是一个用 SQL 编写的解决方案:
with d as (select segment,w,x,y from somedatasource)
select segment,
avg(y) - avg(x) *
((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
((count(*) * sum(x*x)) - (Sum(x)*Sum(x))) as intercept,
((count(*) * sum(x*y)) - (sum(x)*sum(y)))/
((count(*) * sum(x*x)) - (sum(x)*sum(x))) AS slope,
avg(y) - ((avg(x*y) - avg(x)*avg(y))/var_samp(X)) * avg(x) as interceptUnstable,
(avg(x*y) - avg(x)*avg(y))/var_samp(X) as slopeUnstable,
(Avg(x * y) - Avg(x) * Avg(y)) / (stddev_pop(x) * stddev_pop(y)) as correlationUnstable,
(sum(y*w)/sum(w)) - (sum(w*x)/sum(w)) *
((sum(w)*sum(x*y*w)) - (sum(x*w)*sum(y*w)))/
((sum(w)*sum(x*x*w)) - (sum(x*w)*sum(x*w))) as wIntercept,
((sum(w)*sum(x*y*w)) - (sum(x*w)*sum(y*w)))/
((sum(w)*sum(x*x*w)) - (sum(x*w)*sum(x*w))) as wSlope,
(count(*) * sum(x * y) - sum(x) * sum(y)) / (sqrt(count(*) * sum(x * x) - sum(x) * sum(x))
* sqrt(count(*) * sum(y * y) - sum(y) * sum(y))) as correlation,
(sum(w) * sum(x*y*w) - sum(x*w) * sum(y*w)) /
(sqrt(sum(w) * sum(x*x*w) - sum(x*w) * sum(x*w)) * sqrt(sum(w) * sum(y*y*w)
- sum(y*w) * sum(y*w))) as wCorrelation,
count(*) as n
from d where x is not null and y is not null group by segment
其中 w 是权重。我对 R 进行了双重检查以确认结果。可能需要将数据从某个数据源转换为浮点数。我包含了不稳定的版本以警告您不要使用这些版本。(特别感谢 Stephan 在另一个答案中。)
更新:添加加权相关
我已经翻译了 Excel 中预测函数中使用的线性回归函数,并创建了一个返回 a、b 和预测的 SQL 函数。您可以在 FORECAST 功能的 excel 帮助中看到完整的理论解释。首先,您需要创建表数据类型 XYFloatType:
CREATE TYPE [dbo].[XYFloatType]
AS TABLE(
[X] FLOAT,
[Y] FLOAT)
然后编写如下函数:
/*
-- =============================================
-- Author: Me :)
-- Create date: Today :)
-- Description: (Copied Excel help):
--Calculates, or predicts, a future value by using existing values.
The predicted value is a y-value for a given x-value.
The known values are existing x-values and y-values, and the new value is predicted by using linear regression.
You can use this function to predict future sales, inventory requirements, or consumer trends.
-- =============================================
*/
CREATE FUNCTION dbo.FN_GetLinearRegressionForcast
(@PtXYData as XYFloatType READONLY ,@PnFuturePointint)
RETURNS @ABDData TABLE( a FLOAT, b FLOAT, Forecast FLOAT)
AS
BEGIN
DECLARE @LnAvX Float
,@LnAvY Float
,@LnB Float
,@LnA Float
,@LnForeCast Float
Select @LnAvX = AVG([X])
,@LnAvY = AVG([Y])
FROM @PtXYData;
SELECT @LnB = SUM ( ([X]-@LnAvX)*([Y]-@LnAvY) ) / SUM (POWER([X]-@LnAvX,2))
FROM @PtXYData;
SET @LnA = @LnAvY - @LnB * @LnAvX;
SET @LnForeCast = @LnA + @LnB * @PnFuturePoint;
INSERT INTO @ABDData ([A],[B],[Forecast]) VALUES (@LnA,@LnB,@LnForeCast)
RETURN
END
/*
your tests:
(I used the same values that are in the excel help)
DECLARE @t XYFloatType
INSERT @t VALUES(20,6),(28,7),(31,9),(38,15),(40,21) -- x and y values
SELECT *, A+B*30 [Prueba]FROM dbo.FN_GetLinearRegressionForcast@t,30);
*/
我希望以下答案可以帮助人们了解一些解决方案的来源。我将用一个简单的例子来说明它,但是只要你知道如何使用索引符号或矩阵,对许多变量的概括在理论上是很简单的。为了实现超过 3 个变量的解决方案,您将使用 Gram-Schmidt(请参阅上面的 Colin Campbell 的回答)或其他矩阵求逆算法。
由于我们需要的所有函数都是方差、协方差、平均、求和等,都是 SQL 中的聚合函数,因此可以轻松实现解决方案。我在 HIVE 中这样做是为了对 Logistic 模型的分数进行线性校准 - 在许多优点中,一个是您可以完全在 HIVE 中运行,而无需从某种脚本语言进出。
您的数据点由 i 索引的数据模型 (x_1, x_2, y) 是
y(x_1, x_2) = m_1*x_1 + m_2*x_2 + c
模型看起来“线性”,但不一定是,例如 x_2 可以是 x_1 的任何非线性函数,只要其中没有自由参数,例如 x_2 = Sinh(3*(x_1)^2 + 42)。即使 x_2 “只是” x_2 并且模型是线性的,回归问题也不是。只有当您决定问题是找到参数 m_1、m_2、c 以使它们最小化 L2 误差时,您才会遇到线性回归问题。
L2 误差为 sum_i( (y[i] - f(x_1[i], x_2[i]))^2 )。将这 3 个参数最小化(设置每个参数的偏导数 = 0)为 3 个未知数生成 3 个线性方程。这些方程在参数中是线性的(这就是使它成为线性回归的原因)并且可以解析求解。对一个简单的模型(1 个变量,线性模型,因此有两个参数)执行此操作是直接且有启发性的。在误差向量空间上对非欧几里得度量范数的推广很简单,对角线的特殊情况相当于使用“权重”。
回到我们的模型中的两个变量:
y = m_1*x_1 + m_2*x_2 + c
取期望值 =>
= m_1* + m_2* + c (0)
现在取 x_1 和 x_2 的协方差,并使用 cov(x,x) = var(x):
cov(y, x_1) = m_1*var(x_1) + m_2*covar(x_2, x_1) (1)
cov(y, x_2) = m_1*covar(x_1, x_2) + m_2*var(x_2) (2)
这是两个未知数中的两个方程,您可以通过反转 2X2 矩阵来求解。
以矩阵形式: ... 可以反转以产生 ... 其中
det = var(x_1)*var(x_2) - covar(x_1, x_2)^2
(哦,巴夫,“声望点”到底是什么?如果你想看看方程式,给我一些。)
在任何情况下,既然你有封闭形式的 m1 和 m2,你可以解决 c 的 (0)。
我检查了上面对 Excel 求解器的解析解,以获得具有高斯噪声的二次方,并且残差符合 6 位有效数字。
如果您想在大约 20 行的 SQL 中进行离散傅立叶变换,请与我联系。