如果您想找到在平均值的两个标准差内的值,那么您可以使用分析函数(并避免使用自联接):
SELECT dt,
dept_id,
sal,
CASE
WHEN sal BETWEEN avg_sal - 2 * stddev_sal
AND avg_sal + 2 * stddev_sal
THEN 'N'
ELSE 'Y'
END AS out_of_normal_values
FROM (
SELECT t.*,
AVG( sal ) OVER () AS avg_sal,
STDDEV( sal ) OVER () AS stddev_sal
FROM table_name t
);
哪个适合您的数据:
CREATE TABLE table_name ( dt, dept_id, Sal ) AS
SELECT 201907, 10, 250 FROM DUAL UNION ALL
SELECT 201907, 10, 290 FROM DUAL UNION ALL
SELECT 201907, 10, 320 FROM DUAL UNION ALL
SELECT 201907, 10, 100000 FROM DUAL UNION ALL
SELECT 201907, 10, 500000 FROM DUAL UNION ALL
SELECT 201908, 20, 800 FROM DUAL UNION ALL
SELECT 201908, 20, 860 FROM DUAL UNION ALL
SELECT 201908, 20, 700 FROM DUAL UNION ALL
SELECT 201908, 20, 850000 FROM DUAL UNION ALL
SELECT 201908, 20, 1000000 FROM DUAL UNION ALL
SELECT 201909, 10, 260 FROM DUAL UNION ALL
SELECT 201909, 10, 230 FROM DUAL UNION ALL
SELECT 201909, 10, 310 FROM DUAL;
输出:
DT | 部门 ID | 萨尔 | OUT_OF_NORMAL_VALUES
-----: | ------: | ------: | :--------------------
201907 | 10 | 250 | ñ
201907 | 10 | 290 | ñ
201907 | 10 | 320 | ñ
201907 | 10 | 100000 | ñ
201907 | 10 | 500000 | ñ
201908 | 20 | 800 | ñ
201908 | 20 | 860 | ñ
201908 | 20 | 700 | ñ
201908 | 20 | 850000 | ñ
201908 | 20 | 1000000 | 是
201909 | 10 | 260 | ñ
201909 | 10 | 230 | ñ
201909 | 10 | 310 | ñ
其中排除了最极端的值;如果您有更大的数据集,那么这可能会更有效,因为您将拥有更大比例的“正常”值与“异常值”。但是,因为您有一个包含两个相距很远的数据峰的小型数据集,所以平均值位于它们之间,并且您有一个巨大的标准偏差。
如果您想知道异常值会很高,那么您可以使用中位数而不是平均值,并查找介于最小值和中位数之间或高于中位数的相等值的值:
SELECT dt,
dept_id,
sal,
CASE
WHEN sal BETWEEN min_sal
AND median_sal + ( median_sal - min_sal )
THEN 'N'
ELSE 'Y'
END AS out_of_normal_values
FROM (
SELECT t.*,
MEDIAN( sal ) OVER () AS median_sal,
MIN(sal) OVER () AS min_sal
FROM table_name t
)
哪个输出:
DT | 部门 ID | 萨尔 | OUT_OF_NORMAL_VALUES
-----: | ------: | ------: | :--------------------
201909 | 10 | 230 | ñ
201907 | 10 | 250 | ñ
201909 | 10 | 260 | ñ
201907 | 10 | 290 | ñ
201909 | 10 | 310 | ñ
201907 | 10 | 320 | ñ
201908 | 20 | 700 | ñ
201908 | 20 | 800 | ñ
201908 | 20 | 860 | ñ
201907 | 10 | 100000 | 是
201907 | 10 | 500000 | 是
201908 | 20 | 850000 | 是
201908 | 20 | 1000000 | 是
db<>在这里摆弄