我受到这个notebook的启发,我正在试验用于KDDCUP99 数据集的SF版本的异常检测上下文的IsolationForest
算法,包括 4 个属性。数据直接从预处理(分类特征编码的标签)中获取,并在使用默认设置传递给 IF 算法之后。scikit-learn==0.22.2.post1
sklearn
完整代码如下:
from sklearn import datasets
from sklearn import preprocessing
from sklearn.model_selection import train_test_split
from sklearn.ensemble import IsolationForest
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score, roc_curve, roc_auc_score, f1_score, precision_recall_curve, auc
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score
import pandas as pd
import numpy as np
import seaborn as sns
import itertools
import matplotlib.pyplot as plt
import datetime
%matplotlib inline
def byte_decoder(val):
# decodes byte literals to strings
return val.decode('utf-8')
#Load Dataset KDDCUP99 from sklearn
target = 'target'
sf = datasets.fetch_kddcup99(subset='SF', percent10=False) # you can use percent10=True for convenience sake
dfSF=pd.DataFrame(sf.data,
columns=["duration", "service", "src_bytes", "dst_bytes"])
assert len(dfSF)>0, "SF dataset no loaded."
dfSF[target]=sf.target
anomaly_rateSF = 1.0 - len(dfSF.loc[dfSF[target]==b'normal.'])/len(dfSF)
"SF Anomaly Rate is:"+"{:.1%}".format(anomaly_rateSF)
#'SF Anomaly Rate is: 0.5%'
#Data Processing
toDecodeSF = ['service']
# apply hot encoding to fields of type string
# convert all abnormal target types to a single anomaly class
dfSF['binary_target'] = [1 if x==b'normal.' else -1 for x in dfSF[target]]
leSF = preprocessing.LabelEncoder()
for f in toDecodeSF:
dfSF[f + " (encoded)"] = list(map(byte_decoder, dfSF[f]))
dfSF[f + " (encoded)"] = leSF.fit_transform(dfSF[f])
for f in toDecodeSF:
dfSF.drop(f, axis=1, inplace=True)
dfSF.drop(target, axis=1, inplace=True)
#check rate of Anomaly for setting contamination parameter in IF
dfSF["binary_target"].value_counts() / np.sum(dfSF["binary_target"].value_counts())
#data split
X_train_sf, X_test_sf, y_train_sf, y_test_sf = train_test_split(dfSF.drop('binary_target', axis=1),
dfSF['binary_target'],
test_size=0.33,
random_state=11,
stratify=dfSF['binary_target'])
#print(y_test_sf.value_counts())
#1 230899
#-1 1114
#Name: binary_target, dtype: int64
#training IF and predict the outliers/anomalies on test set with 10% contamination:
clfIF = IsolationForest(max_samples="auto",
random_state=11,
contamination = 0.1,
n_estimators=100,
n_jobs=-1)
clfIF.fit(X_train_sf, y_train_sf)
y_pred_test = clfIF.predict(X_test_sf)
#print(X_test_sf.shape)
#(232013, 4)
#print(np.unique(y_pred_test, return_counts=True))
#(array([-1, 1]), array([ 23248, 208765])) # instead of labeling 10% of 232013, which is 23201 data outliers/anomalies, It is 23248 !!
根据二进制情况下的文档,我们可以提取真阳性等,如下所示:
tn, fp, fn, tp = confusion_matrix(y_test_sf, y_pred_test).ravel()
print("TN: ",tn,"FP: ", fp,"FN: " ,fn,"TP: ", tp)
#TN: 1089 FP: 25 FN: 22159 TP: 208740
问题:
- 问题 1:我想知道为什么 IF 通过标记异常值/异常来预测已经在测试集上设置的 10% 以上的污染?23248 而不是 23201 !!
- 问题 2:通常
TN
+FP
应该是内部/正常 230899 并且FN
+TP
应该等于 1114,因为我们在数据拆分后计算。我认为在我的实现中反之亦然,但我无法弄清楚并调试它。 - 问题 3:基于 KDDCUP99 数据集文档及其用户指南和我在以下实现中的计算,异常率为0.5%,这意味着如果我设置
contamination=0.005
,它应该给我
可能我在这里遗漏了一些东西,任何帮助将不胜感激。