数据集如下
,store id,revenue ,profit
0,101,779183,281257
1,101,144829,838451
2,101,766465,757565
3,101,353297,261071
4,101,1615461,275760
5,101,246731,949229
6,101,951518,301016
7,101,444669,430583
代码如下
import pandas as pd
import numpy as np
import pylab
from sklearn.preprocessing import StandardScaler
from pylab import rcParams
df = pd.read_csv(r'data.csv',header=None,sep=',')
df.columns = df.columns.str.replace(' ', '')
dummies = pd.get_dummies(data = df)
del dummies['Unnamed:0']
store = dummies[['storeid']]
test = dummies[['profit']]
qv1 = test[param].quantile(0.25)
qv2 = test[param].quantile(0.5)
qv3 = test[param].quantile(0.75)
qv_limit = 1.5 * (qv3 - qv1)
qv_limit,qv3,qv1
#(688855.5, 776026.0, 316789.0)
un_outliers_mask = (test[param] > qv3 + qv_limit) | (test[param] < qv1 - qv_limit)
un_outliers_data = test[param][un_outliers_mask]
un_outliers_name = store[un_outliers_mask]
un_outliers_data
的输出un_outliers_data
是Series([], Name: profit, dtype: int64)
。有些点是异常值,如您所见1615461
>(776026.0 + 688855.5)