python - Python 过采样在管道中组合了多个采样器

Question

我的问题涉及 SMOTE 类引发的值错误。

预期 n_neighbors <= n_samples，但 n_samples = 1，n_neighbors = 6

# imbalanced learn is a package containing impelementation of SMOTE
from imblearn.over_sampling import SMOTE, ADASYN, RandomOverSampler
from imblearn.pipeline import Pipeline
# label column (everythin except the first column)
y = feature_set.iloc[:,0]
# feature matrix: everything except text and label columns
x = feature_set.loc[:, feature_set.columns != 'text_column']
x = x.loc[:, x.columns != 'label_column']
x_resampled, y_resampled = SMOTE().fit_resample(x, y)

经过一番调查，我发现我的一些班级（总共 158 个班级）的样本极少。

根据这篇文章中提出的解决方案

创建一个使用 SMOTE 和 RandomOversampler 的管道，以满足 smoted 类的条件 n_neighbors <= n_samples 并在不满足条件时使用随机过采样。

但是，我仍在努力设置和运行我的实验。

# initilize oversamplers
smote = SMOTE()
randomSampler = RandomOverSampler()
# create a pipeline
pipeline = Pipeline([('smote', smote), ('randomSampler', randomSampler)])
pipeline.fit_resample(x, y)

当我运行它时，我仍然有同样的错误。我的猜测是，生成的管道应用了两个采样器，而我只需要一次应用其中一个，基于预定义的条件（如果项目数小于 X，则为 RandomSampler，否则为 SMOTE）。有没有办法在项目数量极少的情况下设置调用 RandomSampler 的条件？

先感谢您。

score 2 · Accepted Answer

我也遇到了和你一样的问题（Expected n_neighbors <= n_samples, but n_samples = 1, n_neighbors = 6），和你一样阅读并遵循了那个人的建议。

我认为您遇到了同样的错误，因为您将随机过采样器放在 SMOTE 操作之后。也就是说，您需要在应用 SMOTE 算法之前对少数类进行过采样。

这对我有用：

pipe = Pipeline([
('tfidf', TfidfVectorizer()), 
('ros', RandomOverSampler()),
('oversampler', SMOTE()),
('clf', LinearSVC()),
])

python - Python 过采样在管道中组合了多个采样器

1 回答 1

Related

Reference