python - 用于异常检测的隔离森林

Question

在此异常检测示例中：IsolationForest

import numpy as np
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

rng = np.random.RandomState(42)

# Generate train data
X = 0.3 * rng.randn(100, 2)
X_train = np.r_[X + 2, X - 2]
# Generate some regular novel observations
X = 0.3 * rng.randn(20, 2)
X_test = np.r_[X + 2, X - 2]
# Generate some abnormal novel observations
X_outliers = rng.uniform(low=-4, high=4, size=(20, 2))

# fit the model
clf = IsolationForest(max_samples=100, random_state=rng)
clf.fit(X_train)
y_pred_train = clf.predict(X_train)
y_pred_test = clf.predict(X_test)
y_pred_outliers = clf.predict(X_outliers)

我相信这段代码中的异常值是随机引入的。但是，如果我使用真实数据进行异常检测，那么：

我该如何推进这件事？
如果我已经有数据集，如何识别异常？我正在尝试使用联合循环发电厂数据集。或者，如果您有任何其他好的异常检测实践数据集，请删除一些链接！

score 0 · Accepted Answer

如果您要求数据集中的污染。然后您需要检查污染参数。

clf = IsolationForest(max_samples=100, random_state=rng, contamination='auto')

它基于数据具有一定程度的污染的假设。

score 0 · Accepted Answer

是rng随机数。您可以将其视为合成数据集。对于现实世界的数据集，您必须使用or的load函数来加载它。numpypandas

你可以在这里找到一些异常检测任务的数据集http://odds.cs.stonybrook.edu

到目前为止，我可以说 Prophet 是一个流行的时间序列分析任务框架，其中包括流数据的异常检测https://www.kaggle.com/vinayjaju/anomaly-detection-using-facebook-s-prophet

python - 用于异常检测的隔离森林

2 回答 2

Related

Reference