python - 如何使用 python 中的新数据集/数据农场更新训练有素的 IsolationForest 模型？

Question

假设我IsolationForest()在基于时间序列的 Dataset1 或 dataframe1 上拟合来自 scikit-learn 的算法，并使用此处和此处df1提到的方法保存模型。现在我想为新的dataset2 或.df2

我的发现：

这个关于从 sklearn 进行增量学习的解决方法：

...从小批量实例中增量学习（有时称为“在线学习”）是核心外学习的关键，因为它保证在任何给定时间，主实例中只有少量实例记忆。为平衡相关性和内存占用的小批量选择合适的大小可能涉及调整。

但遗憾的是 IF 算法不支持estimator.partial_fit(newdf)

根据这篇文章， auto-sklearn 优惠refit()也不适合我的情况。

如何使用新的 Dataset2 更新在 Dataset1 上训练和保存的 IF 模型？

score 0 · Accepted Answer

您可以简单地重用对新数据的估计器可用.fit()的调用。

这将是首选，尤其是在时间序列中，因为信号会发生变化，并且您不希望将较旧的非代表性数据理解为潜在的正常（或异常）。

如果旧数据很重要，您可以简单地将旧的训练数据和新的输入信号数据连接在一起，然后.fit()再次调用。

另请注意，根据 sklearn 文档，它比使用joblib更好pickle

具有以下资源的MRE：

# Model
from sklearn.ensemble import IsolationForest

# Saving file
import joblib

# Data
import numpy as np

# Create a new model
model = IsolationForest()

# Generate some old data
df1 = np.random.randint(1,100,(100,10))
# Train the model
model.fit(df1)

# Save it off
joblib.dump(model, 'isf_model.joblib')

# Load the model
model = joblib.load('isf_model.joblib')

# Generate new data
df2 = np.random.randint(1,500,(1000,10))

# If the original data is now not important, I can just call .fit() again.
# If you are using time-series based data, this is preferred, as older data may not be representative of the current state
model.fit(df2)

# If the original data is important, I can simply join the old data to new data. There are multiple options for this:
# Pandas: https://pandas.pydata.org/pandas-docs/stable/user_guide/merging.html
# Numpy: https://numpy.org/doc/stable/reference/generated/numpy.concatenate.html

combined_data = np.concatenate((df1, df2))
model.fit(combined_data)

python - 如何使用 python 中的新数据集/数据农场更新训练有素的 IsolationForest 模型？

1 回答 1

Related

Reference