0

我受到这个notebook的启发,我正在试验用于KDDCUP99 数据集的SF版本的异常检测上下文的IsolationForest算法,包括 4 个属性。数据直接从预处理(分类特征编码的标签)中获取,并在使用默认设置传递给 IF 算法之后。scikit-learn==0.22.2.post1sklearn

完整代码如下:

from sklearn import datasets
from sklearn import preprocessing
from sklearn.model_selection import train_test_split

from sklearn.ensemble import IsolationForest
from sklearn.metrics import confusion_matrix
from sklearn.metrics import recall_score, roc_curve, roc_auc_score, f1_score, precision_recall_curve, auc
from sklearn.metrics import make_scorer
from sklearn.metrics import accuracy_score

import pandas as pd
import numpy as np
import seaborn as sns
import itertools
import matplotlib.pyplot as plt
import datetime

%matplotlib inline


def byte_decoder(val):
    # decodes byte literals to strings
    
    return val.decode('utf-8')

#Load Dataset KDDCUP99 from sklearn
target = 'target'
sf = datasets.fetch_kddcup99(subset='SF', percent10=False) # you can use percent10=True for convenience sake
dfSF=pd.DataFrame(sf.data, 
                  columns=["duration", "service", "src_bytes", "dst_bytes"])
assert len(dfSF)>0, "SF dataset no loaded."

dfSF[target]=sf.target
anomaly_rateSF = 1.0 - len(dfSF.loc[dfSF[target]==b'normal.'])/len(dfSF)

"SF Anomaly Rate is:"+"{:.1%}".format(anomaly_rateSF)
#'SF Anomaly Rate is: 0.5%'

#Data Processing 
toDecodeSF = ['service']
# apply hot encoding to fields of type string
# convert all abnormal target types to a single anomaly class

dfSF['binary_target'] = [1 if x==b'normal.' else -1 for x in dfSF[target]]
    
leSF = preprocessing.LabelEncoder()

for f in toDecodeSF:
    dfSF[f + " (encoded)"] = list(map(byte_decoder, dfSF[f]))
    dfSF[f + " (encoded)"] = leSF.fit_transform(dfSF[f])

for f in toDecodeSF:
  dfSF.drop(f, axis=1, inplace=True)

dfSF.drop(target, axis=1, inplace=True)

#check rate of Anomaly for setting contamination parameter in IF
dfSF["binary_target"].value_counts() / np.sum(dfSF["binary_target"].value_counts())



#data split
X_train_sf, X_test_sf, y_train_sf, y_test_sf = train_test_split(dfSF.drop('binary_target', axis=1), 
                                                                dfSF['binary_target'], 
                                                                test_size=0.33,
                                                                random_state=11,
                                                                stratify=dfSF['binary_target'])

#print(y_test_sf.value_counts())
#1       230899
#-1      1114
#Name: binary_target, dtype: int64

#training IF and predict the outliers/anomalies on test set with 10% contamination:
clfIF = IsolationForest(max_samples="auto",
                        random_state=11,
                        contamination = 0.1,
                        n_estimators=100,
                        n_jobs=-1)

clfIF.fit(X_train_sf, y_train_sf)
y_pred_test = clfIF.predict(X_test_sf)

#print(X_test_sf.shape)
#(232013, 4)

#print(np.unique(y_pred_test, return_counts=True))
#(array([-1,  1]), array([ 23248, 208765])) # instead of labeling 10% of 232013, which is 23201 data outliers/anomalies, It is 23248 !!

根据二进制情况下的文档,我们可以提取真阳性等,如下所示:

tn, fp, fn, tp = confusion_matrix(y_test_sf, y_pred_test).ravel()
print("TN: ",tn,"FP: ", fp,"FN: " ,fn,"TP: ", tp)
#TN:  1089 FP:  25 FN:  22159 TP:  208740

问题

  • 问题 1:我想知道为什么 IF 通过标记异常值/异常来预测已经在测试集上设置的 10% 以上的污染?23248 而不是 23201 !!
  • 问题 2:通常TN+FP应该是内部/正常 230899 并且FN+TP应该等于 1114,因为我们在数据拆分后计算。我认为在我的实现中反之亦然,但我无法弄清楚并调试它。
  • 问题 3:基于 KDDCUP99 数据集文档及其用户指南和我在以下实现中的计算,异常率为0.5%,这意味着如果我设置contamination=0.005,它应该给我

图像

可能我在这里遗漏了一些东西,任何帮助将不胜感激。

4

1 回答 1

0

事实是,当评分数据点应被视为异常值时,污染参数仅控制决策函数的阈值。它对模型本身没有影响。使用一些统计分析来粗略估计污染可能是有意义的。

如果您预计数据集中有一定数量的异常值,那么您可以使用原始分数来找到一个阈值,该阈值可以为您提供该数字,并在将模型应用于新数据时追溯设置污染参数。

于 2021-03-23T20:46:37.907 回答