2

我有如下示例数据,我想过滤/识别出正常值。不确定如何在 SQL 构造中排除异常值。尝试取平均值(Sal),但不确定如何从平均值中排除那些高值?

    date   dept_id  Sal
    201907   10     250
    201907   10     290
    201907   10     320
    201907   10     100000
    201907   10     500000
    201908   20     800
    201908   20     860
    201908   20     700
    201908   20     850000
    201908   20     1000000
    201909   10     260
    201909   10     230
    201909   10     310

预期输出如下

    date   dept_id  Sal     out_of_normal_values
    201907   10     250         N
    201907   10     290         N
    201907   10     320         N
    201907   10     100000      Y   
    201907   10     500000      Y
    201908   20     800         N 
    201908   20     860         N
    201908   20     700         N
    201908   20     850000      Y
    201908   20     1000000     Y
    201909   10     260         N
    201909   10     230         N
    201909   10     310         N
4

3 回答 3

2

您可以创建 CASE 语句如下

case
    when
        sal > 1000
    then
        'Y'
    else
        'N'
end as out_of_normal_values
于 2020-04-15T04:40:14.210 回答
1

如果您想找到在平均值的两个标准差内的值,那么您可以使用分析函数(并避免使用自联接):

SELECT dt,
       dept_id,
       sal,
       CASE
       WHEN sal BETWEEN avg_sal - 2 * stddev_sal
                AND     avg_sal + 2 * stddev_sal
       THEN 'N'
       ELSE 'Y'
       END AS out_of_normal_values
FROM   (
  SELECT t.*,
         AVG( sal ) OVER () AS avg_sal,
         STDDEV( sal ) OVER () AS stddev_sal
  FROM   table_name t
);

哪个适合您的数据:

CREATE TABLE  table_name ( dt, dept_id, Sal ) AS
SELECT 201907,   10,     250 FROM DUAL UNION ALL
SELECT 201907,   10,     290 FROM DUAL UNION ALL
SELECT 201907,   10,     320 FROM DUAL UNION ALL
SELECT 201907,   10,     100000 FROM DUAL UNION ALL
SELECT 201907,   10,     500000 FROM DUAL UNION ALL
SELECT 201908,   20,     800 FROM DUAL UNION ALL
SELECT 201908,   20,     860 FROM DUAL UNION ALL
SELECT 201908,   20,     700 FROM DUAL UNION ALL
SELECT 201908,   20,     850000 FROM DUAL UNION ALL
SELECT 201908,   20,     1000000 FROM DUAL UNION ALL
SELECT 201909,   10,     260 FROM DUAL UNION ALL
SELECT 201909,   10,     230 FROM DUAL UNION ALL
SELECT 201909,   10,     310 FROM DUAL;

输出:

    DT | 部门 ID | 萨尔 | OUT_OF_NORMAL_VALUES
-----: | ------: | ------: | :--------------------
201907 | 10 | 250 | ñ                   
201907 | 10 | 290 | ñ                   
201907 | 10 | 320 | ñ                   
201907 | 10 | 100000 | ñ                   
201907 | 10 | 500000 | ñ                   
201908 | 20 | 800 | ñ                   
201908 | 20 | 860 | ñ                   
201908 | 20 | 700 | ñ                   
201908 | 20 | 850000 | ñ                   
201908 | 20 | 1000000 | 是                   
201909 | 10 | 260 | ñ                   
201909 | 10 | 230 | ñ                   
201909 | 10 | 310 | ñ                   

其中排除了最极端的值;如果您有更大的数据集,那么这可能会更有效,因为您将拥有更大比例的“正常”值与“异常值”。但是,因为您有一个包含两个相距很远的数据峰的小型数据集,所以平均值位于它们之间,并且您有一个巨大的标准偏差。

如果您想知道异常值会很高,那么您可以使用中位数而不是平均值,并查找介于最小值和中位数之间或高于中位数的相等值的值:

SELECT dt,
       dept_id,
       sal,
       CASE
       WHEN sal BETWEEN min_sal
                AND     median_sal + ( median_sal - min_sal )
       THEN 'N'
       ELSE 'Y'
       END AS out_of_normal_values
FROM   (
  SELECT t.*,
         MEDIAN( sal ) OVER () AS median_sal,
         MIN(sal) OVER () AS min_sal
  FROM   table_name t
)

哪个输出:

    DT | 部门 ID | 萨尔 | OUT_OF_NORMAL_VALUES
-----: | ------: | ------: | :--------------------
201909 | 10 | 230 | ñ                   
201907 | 10 | 250 | ñ                   
201909 | 10 | 260 | ñ                   
201907 | 10 | 290 | ñ                   
201909 | 10 | 310 | ñ                   
201907 | 10 | 320 | ñ                   
201908 | 20 | 700 | ñ                   
201908 | 20 | 800 | ñ                   
201908 | 20 | 860 | ñ                   
201907 | 10 | 100000 | 是                   
201907 | 10 | 500000 | 是                   
201908 | 20 | 850000 | 是                   
201908 | 20 | 1000000 | 是                   

db<>在这里摆弄

于 2020-04-15T08:55:21.133 回答
1

您可以使用joinandgroup by来获得所需的结果

select a.date, b.dept_id, a.sal,
    case
        when b.avg_sal < a.sal then 'Y'
        else 'N'
    end as out_of_normal
from tbl a join (
    select dept_id, avg(sal) avg_sal from tbl
    group by dept_id
) b on a.dept_id = b.dept_id
于 2020-04-15T05:00:11.790 回答