python - 通过 scikit-learn 的隔离森林 (IF) 查找异常值

问问题 2018-04-19T14:38:12.570

958 次

我正在尝试使用 5000 个观察值和 800 个特征来检测我的数据集中的异常值。我已按照http://scikit-learn.org/stable/auto_examples/ensemble/plot_isolation_forest.html中的简单步骤进行操作

stackoverflow 和其他来源中也有一些示例，但是，我找不到关于隔离森林返回的异常值的解释的体面解释。首先，我所做的是：

from sklearn.ensemble import IsolationForest

X_train = trbb[check_cols]

clf = IsolationForest(n_jobs=6,n_estimators=500, max_samples=256, random_state=23)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_train

这将返回 array([1, 1, 1, ..., 1, 1, 1]) 其中 -1 是异常值。

的形状y_pred_train为 5000，与相同X_train[0]。所以-1的索引对应X_train的索引。对于经验。IF 返回的异常值索引之一是 532。因此这意味着该索引中的点（在 800 维空间中）被检测为异常值。检测到这一点后，我该如何接近并使用结果进行更多挖掘？例如，我能否找到导致异常值的最重要特征？

python - 通过 scikit-learn 的隔离森林 (IF) 查找异常值

0 回答 0

Related

Reference