class - 试图通过 scikit-learn 中的 sample_weight 平衡我的数据集

Question

我正在使用 RandomForest 进行分类，我得到了一个不平衡的数据集，如：5830-no，1006-yes。我尝试用 class_weight 和 sample_weight 平衡我的数据集，但我做不到。

我的代码是：

X_train,X_test,y_train,y_test = train_test_split(arrX,y,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})

但是在使用 class_weight 和 sample_weight 时，我的比率 TPR、FPR、ROC 没有任何改善。

为什么？我做错什么了吗？

不过，如果我使用称为 balance_subsample 的函数，我的比率会得到很大的改进：

def balanced_subsample(x,y,subsample_size):

    class_xs = []
    min_elems = None

    for yi in np.unique(y):
        elems = x[(y == yi)]
        class_xs.append((yi, elems))
        if min_elems == None or elems.shape[0] < min_elems:
            min_elems = elems.shape[0]

    use_elems = min_elems
    if subsample_size < 1:
        use_elems = int(min_elems*subsample_size)

    xs = []
    ys = []

    for ci,this_xs in class_xs:
        if len(this_xs) > use_elems:
            np.random.shuffle(this_xs)

        x_ = this_xs[:use_elems]
        y_ = np.empty(use_elems)
        y_.fill(ci)

        xs.append(x_)
        ys.append(y_)

    xs = np.concatenate(xs)
    ys = np.concatenate(ys)

    return xs,ys

我的新代码是：

X_train_subsampled,y_train_subsampled=balanced_subsample(arrX,y,0.5)
X_train,X_test,y_train,y_test = train_test_split(X_train_subsampled,y_train_subsampled,test_size=0.25)
cw='auto'
clf=RandomForestClassifier(class_weight=cw) 
param_grid = { 'n_estimators': [10,50,100,200,300],'max_features': ['auto', 'sqrt', 'log2']}
sw = np.array([1 if i == 0 else 8 for i in y_train])
CV_clf = GridSearchCV(estimator=clf, param_grid=param_grid, cv= 10,fit_params={'sample_weight': sw})

谢谢

score 2 · Accepted Answer

这还不是一个完整的答案，但希望它会帮助到达那里。

首先是一些一般性的评论：

要调试此类问题，确定性行为通常很有用。您可以将random_state属性传递给RandomForestClassifier具有固有随机性的各种 scikit-learn 对象，以便在每次运行时获得相同的结果。您还需要：
```
import numpy as np
np.random.seed()
import random
random.seed()
```

让您的balanced_subsample函数在每次运行时都以相同的方式运行。

不要网格搜索n_estimators：在随机森林中，更多的树总是更好。
请注意，sample_weight并且class_weight有一个类似的目标：实际样本权重将是sample_weight* 从中推断出的权重class_weight。

你能试试：

balanced_subsample在你的函数中使用 subsample=1 。除非有特殊的理由不这样做，否则我们最好比较相似数量的样本的结果。
使用您的二次抽样策略，class_weight并将sample_weight两者都设置为无。

编辑：再次阅读您的评论，我意识到您的结果并不令人惊讶！
你得到一个更好（更高）的 TPR 但更差（更高）的 FPR。
这只是意味着您的分类器会努力从第 1 类中获得正确的样本，从而产生更多的误报（当然同时也获得更多正确的样本！）。
如果您继续沿同一方向增加类/样本权重，您将看到这种趋势继续存在。

score 1 · Accepted Answer

有一个不平衡学习 API 可以帮助对在这种情况下可能有用的数据进行过采样/欠采样。您可以将训练集传递给其中一种方法，它会为您输出过采样数据。请参阅下面的简单示例

from imblearn.over_sampling import RandomOverSampler

ros = RandomOverSampler(random_state=1)

x_oversampled, y_oversampled = ros.fit_sample(orig_x_data, orig_y_data)

这里是 API 的链接： http: //contrib.scikit-learn.org/imbalanced-learn/api.html

希望这可以帮助！

class - 试图通过 scikit-learn 中的 sample_weight 平衡我的数据集

2 回答 2

Related

Reference