7

我正在使用 SQL 查询来确定几列的 z 分数 (x - μ / σ)。

特别是,我有一个如下表:

my_table
id    col_a  col_b  col_c
1     3      6      5
2     5      3      3
3     2      2      9
4     9      8      2

...并且我想根据其列的平均值和标准差来选择每行的每个数字的 z 分数。

所以结果看起来像这样:

id    col_d     col_e     col_f
1    -0.4343    1.0203    ...
2     0.1434   -0.8729
3    -0.8234   -1.2323
4     1.889     1.5343

目前我的代码计算两列的分数,如下所示:

select id,
   (my_table.col_a - avg(mya.col_a)) / stddev(mya.col_a) as col_d,
   (my_table.col_b - avg(myb.col_b)) / stddev(myb.col_b) as col_e, 
from my_table,
select col_a from my_table)mya,
select col_b from my_table)myb
group by id;

但是,这非常缓慢。我一直在等待三列查询。

有没有更好的方法来实现这一点?我正在使用 postgres,但任何通用语言都会对我有所帮助。谢谢!

4

3 回答 3

18

你可以使用这样的窗口函数:

select
    t.id,
    (t.col_a - avg(t.col_a) over()) / stdev(t.col_a) over() as col_d,
    (t.col_b - avg(t.col_b) over()) / stdev(t.col_b) over() as col_e
from my_table as t

avg或与预先计算的和交叉连接stdev

select
    t.id,
    (t.col_a - tt.col_a_avg) / tt.col_a_stdev as col_d,
    (t.col_b - tt.col_b_avg) / tt.col_b_stdev as col_e
from my_table as t
    cross join (
        select 
            avg(tt.col_a) as col_a_avg,
            avg(tt.col_b) as col_b_avg,
            stdev(tt.col_a) as col_a_stdev,
            stdev(tt.col_b) as col_b_stdev
        from my_table as tt
   ) as tt
于 2013-10-09T18:17:39.260 回答
0

使用 WITH 子句:

WITH stats AS ( SELECT avg ( col_a ) a_avg, stddev ( col_a ) a_stddev,
                       avg ( col_b ) b_avg, stddev ( col_b ) b_stddev
                    FROM my_table 
              )
SELECT id, ( col_a - a_avg) / a_stddev col_d, 
           ( col_b - b_avg) / b_stddev col_e
    FROM my_table, stats

但我更喜欢 Roman 的窗口解决方案。

对于 Oğuz:处理 my_table 中的 NULL 值:

WITH stats AS ( 
              SELECT avg ( col_a ) a_avg, stddev ( col_a ) as a_stddev,
                     avg ( col_b ) b_avg, stddev ( col_b ) as b_stddev
                  FROM my_table 
              )
SELECT id, 
       COALESCE ( ( col_a - a_avg) / a_stddev, NULL ) col_d, 
       COALESCE ( ( col_b - b_avg) / b_stddev, NULL ) col_e
FROM my_table, stats
于 2017-05-27T13:47:20.213 回答
-2

我首先将 avg() 和 stddev() 属性选择到表变量中,然后使用该表进行计算

所以你会得到一个表变量,其中包含以下列 AVG_col_a、stddev_col_a、AVG_col b、stddev_col_b ......

像这样的东西

DECLARE @Table as table (AVG_col_a, stddev_col_a, AVG_col b, stddev_col_b ......)
INSERT into @Table
SELECT AVG(col_A), stddev(col_a), .......
FROM myTable

SELECT (m.col_a-AVG_col_a)/stddev_col_a as col_d,
       (m.col_b-AVG_col_b)/stddev_col_b as col_e
 FROM myTable m, @Table
于 2013-10-09T18:09:34.097 回答