python - Scikit-learn 平衡子采样

Question

我正在尝试为我的大型不平衡数据集创建 N 个平衡随机子样本。有没有办法简单地使用 scikit-learn / pandas 来做到这一点，还是我必须自己实现它？任何指向执行此操作的代码的指针？

这些子样本应该是随机的，并且可以重叠，因为我将每个子样本提供给一个非常大的分类器集合中的单独分类器。

在 Weka 中有一个名为 spreadsubsample 的工具，在 sklearn 中是否有等价物？ http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample

（我知道权重，但这不是我想要的。）

score 32 · Accepted Answer

现在有一个成熟的 python 包来解决不平衡的数据。它可以作为 sklearn-contrib 包在https://github.com/scikit-learn-contrib/imbalanced-learn获得

score 30 · Accepted Answer

这是我的第一个版本，似乎运行良好，请随意复制或提出如何提高效率的建议（我在一般编程方面有相当长的经验，但在 python 或 numpy 方面没有那么长的时间）

此函数创建单个随机平衡子样本。

编辑：子样本大小现在对少数类进行抽样，这可能应该改变。

def balanced_subsample(x,y,subsample_size=1.0):

    class_xs = []
    min_elems = None

    for yi in np.unique(y):
        elems = x[(y == yi)]
        class_xs.append((yi, elems))
        if min_elems == None or elems.shape[0] < min_elems:
            min_elems = elems.shape[0]

    use_elems = min_elems
    if subsample_size < 1:
        use_elems = int(min_elems*subsample_size)

    xs = []
    ys = []

    for ci,this_xs in class_xs:
        if len(this_xs) > use_elems:
            np.random.shuffle(this_xs)

        x_ = this_xs[:use_elems]
        y_ = np.empty(use_elems)
        y_.fill(ci)

        xs.append(x_)
        ys.append(y_)

    xs = np.concatenate(xs)
    ys = np.concatenate(ys)

    return xs,ys

对于任何尝试使用 Pandas DataFrame 进行上述操作的人，您需要进行一些更改：

将np.random.shuffle行替换为

this_xs = this_xs.reindex(np.random.permutation(this_xs.index))
将np.concatenate行替换为

xs = pd.concat(xs) ys = pd.Series(data=np.concatenate(ys),name='target')

score 9 · Accepted Answer

熊猫系列的一个版本：

import numpy as np

def balanced_subsample(y, size=None):

    subsample = []

    if size is None:
        n_smp = y.value_counts().min()
    else:
        n_smp = int(size / len(y.value_counts().index))

    for label in y.value_counts().index:
        samples = y[y == label].index.values
        index_range = range(samples.shape[0])
        indexes = np.random.choice(index_range, size=n_smp, replace=False)
        subsample += samples[indexes].tolist()

    return subsample

score 8 · Accepted Answer

我在这里找到了最好的解决方案

这是我认为最简单的一个。

dataset = pd.read_csv("data.csv")
X = dataset.iloc[:, 1:12].values
y = dataset.iloc[:, 12].values

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X, y)

那么你可以使用X_rus, y_rus数据

对于版本 0.4<=：

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_rus, y_rus= rus.fit_sample(X, y)

然后，可以通过sample_indices_属性达到随机选择的样本的索引。

score 5 · Accepted Answer

中公开的内置数据拆分技术中不提供这种类型的数据拆分sklearn.cross_validation。

看起来与您的需求相似的是sklearn.cross_validation.StratifiedShuffleSplit，它可以生成任何大小的子样本，同时保留整个数据集的结构，即精心执行与主数据集相同的不平衡。虽然这不是您要查找的内容，但您可以使用其中的代码并将强制比率始终更改为 50/50。

（如果你愿意的话，这可能是对 scikit-learn 的一个很好的贡献。）

score 3 · Accepted Answer

这是适用于多类组的上述代码的一个版本（在我的测试案例组 0、1、2、3、4 中）

import numpy as np
def balanced_sample_maker(X, y, sample_size, random_seed=None):
    """ return a balanced data set by sampling all classes with sample_size 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = np.unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        np.random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx
    # oversampling on observations of each label
    balanced_copy_idx = []
    for gb_level, gb_idx in groupby_levels.iteritems():
        over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
        balanced_copy_idx+=over_sample_idx
    np.random.shuffle(balanced_copy_idx)

    return (X[balanced_copy_idx, :], y[balanced_copy_idx], balanced_copy_idx)

这也会返回索引，以便它们可用于其他数据集并跟踪每个数据集的使用频率（有助于训练）

score 3 · Accepted Answer

下面是我用于创建平衡数据副本的 python 实现。假设： 1. 目标变量 (y) 是二元类（0 vs. 1） 2. 1 是少数。

from numpy import unique
from numpy import random 

def balanced_sample_maker(X, y, random_seed=None):
    """ return a balanced data set by oversampling minority class 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx

    # oversampling on observations of positive label
    sample_size = uniq_counts[0]
    over_sample_idx = random.choice(groupby_levels[1], size=sample_size, replace=True).tolist()
    balanced_copy_idx = groupby_levels[0] + over_sample_idx
    random.shuffle(balanced_copy_idx)

    return X[balanced_copy_idx, :], y[balanced_copy_idx]

score 1 · Accepted Answer

mikkom 对最佳答案的轻微修改。

如果您想保留较大类数据的排序，即。你不想洗牌。

代替

    if len(this_xs) > use_elems:
        np.random.shuffle(this_xs)

做这个

        if len(this_xs) > use_elems:
            ratio = len(this_xs) / use_elems
            this_xs = this_xs[::ratio]

score 1 · Accepted Answer

这是我的 2 美分。假设我们有以下不平衡数据集：

import pandas as pd
import numpy as np

df = pd.DataFrame({'Category': np.random.choice(['A','B','C'], size=1000, replace=True, p=[0.3, 0.5, 0.2]),
                   'Sentiment': np.random.choice([0,1], size=1000, replace=True, p=[0.35, 0.65]),
                   'Gender': np.random.choice(['M','F'], size=1000, replace=True, p=[0.70, 0.30])})
print(df.head())

第一行：

  Category  Sentiment Gender
0        C          1      M
1        B          0      M
2        B          0      M
3        B          0      M
4        A          0      M

现在假设我们想要通过 Sentiment 获得一个平衡的数据集：

df_grouped_by = df.groupby(['Sentiment'])

df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

df_balanced = df_balanced.droplevel(['Sentiment'])
df_balanced
print(df_balanced.head())

平衡数据集的第一行：

  Category  Sentiment Gender
0        C          0      F
1        C          0      M
2        C          0      F
3        C          0      M
4        C          0      M

让我们验证它在以下方面是平衡的Sentiment

df_balanced.groupby(['Sentiment']).size()

我们得到：

Sentiment
0    369
1    369
dtype: int64

正如我们所见，我们最终得到了 369 个正面和 369 个负面的情绪标签。

score 1 · Accepted Answer

一个简短的Pythonic 解决方案，通过二次uspl=True采样（uspl=False

对于uspl=True，此代码将在不替换大小等于所有层中最小层的情况下随机抽取样本。对于uspl=False，此代码将采用替换大小等于所有层中最大层的随机样本。

def balanced_spl_by(df, lblcol, uspl=True):
    datas_l = [ df[df[lblcol]==l].copy() for l in list(set(df[lblcol].values)) ]
    lsz = [f.shape[0] for f in datas_l ]
    return pd.concat([f.sample(n = (min(lsz) if uspl else max(lsz)), replace = (not uspl)).copy() for f in datas_l ], axis=0 ).sample(frac=1)

这仅适用于 Pandas DataFrame，但这似乎是一个常见的应用程序，据我所知，将其限制为 Pandas DataFrames 会显着缩短代码。

score 1 · Accepted Answer

只需使用以下代码在每个类中选择 100 行重复项。activity是我的课程（数据集的标签）

balanced_df=Pdf_train.groupby('activity',as_index = False,group_keys=False).apply(lambda s: s.sample(100,replace=True))

score 0 · Accepted Answer

我的子采样器版本，希望这会有所帮助

def subsample_indices(y, size):
    indices = {}
    target_values = set(y_train)
    for t in target_values:
        indices[t] = [i for i in range(len(y)) if y[i] == t]
    min_len = min(size, min([len(indices[t]) for t in indices]))
    for t in indices:
        if len(indices[t]) > min_len:
            indices[t] = random.sample(indices[t], min_len)
    return indices

x = [1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, -1]
j = subsample_indices(x, 2)
print j
print [x[t] for t in j[-1]]
print [x[t] for t in j[1]]

score 0 · Accepted Answer

尽管已经回答了，但我偶然发现了您的问题，正在寻找类似的东西。经过一些更多的研究，我相信sklearn.model_selection.StratifiedKFold可以用于此目的：

from sklearn.model_selection import StratifiedKFold

X = samples_array
y = classes_array # subsamples will be stratified according to y
n = desired_number_of_subsamples

skf = StratifiedKFold(n, shuffle = True)

batches = []
for _, batch in skf.split(X, y):
    do_something(X[batch], y[batch])

添加很重要，_因为因为skf.split()它用于为 K 折叠交叉验证创建分层折叠，它返回两个索引列表：（train元素n - 1 / n）和测试（1 / n元素）。

请注意，这是从sklearn 0.18 开始的。在sklearn 0.17中，可以在模块中找到相同的功能cross_validation。

score 0 · Accepted Answer

这是我的解决方案，它可以紧密集成到现有的 sklearn 管道中：

from sklearn.model_selection import RepeatedKFold
import numpy as np


class DownsampledRepeatedKFold(RepeatedKFold):

    def split(self, X, y=None, groups=None):
        for i in range(self.n_repeats):
            np.random.seed()
            # get index of major class (negative)
            idxs_class0 = np.argwhere(y == 0).ravel()
            # get index of minor class (positive)
            idxs_class1 = np.argwhere(y == 1).ravel()
            # get length of minor class
            len_minor = len(idxs_class1)
            # subsample of major class of size minor class
            idxs_class0_downsampled = np.random.choice(idxs_class0, size=len_minor)
            original_indx_downsampled = np.hstack((idxs_class0_downsampled, idxs_class1))
            np.random.shuffle(original_indx_downsampled)
            splits = list(self.cv(n_splits=self.n_splits, shuffle=True).split(original_indx_downsampled))

            for train_index, test_index in splits:
                yield original_indx_downsampled[train_index], original_indx_downsampled[test_index]

    def __init__(self, n_splits=5, n_repeats=10, random_state=None):
        self.n_splits = n_splits
         super(DownsampledRepeatedKFold, self).__init__(
        n_splits=n_splits, n_repeats=n_repeats, random_state=random_state
    )

像往常一样使用它：

    for train_index, test_index in DownsampledRepeatedKFold(n_splits=5, n_repeats=10).split(X, y):
         X_train, X_test = X[train_index], X[test_index]
         y_train, y_test = y[train_index], y[test_index]

score 0 · Accepted Answer

这是一个解决方案：

简单（< 10 行代码）
快速（除了一个for循环，纯 NumPy）
除 NumPy 外，没有外部依赖项
生成新的平衡随机样本非常便宜（只需调用np.random.sample()）。用于在训练时期之间生成不同的混洗和平衡样本

def stratified_random_sample_weights(labels):
    sample_weights = np.zeros(num_samples)
    for class_i in range(n_classes):
        class_indices = np.where(labels[:, class_i]==1)  # find indices where class_i is 1
        class_indices = np.squeeze(class_indices)  # get rid of extra dim
        num_samples_class_i = len(class_indices)
        assert num_samples_class_i > 0, f"No samples found for class index {class_i}"
        
        sample_weights[class_indices] = 1.0/num_samples_class_i  # note: samples with no classes present will get weight=0

    return sample_weights / sample_weights.sum()  # sum(weights) == 1

然后，您一遍又一遍地使用这些权重来生成平衡索引np.random.sample()：

sample_weights = stratified_random_sample_weights(labels)
chosen_indices = np.random.choice(list(range(num_samples)), size=sample_size, replace=True, p=sample_weights)

完整示例：

# generate data
from sklearn.preprocessing import OneHotEncoder

num_samples = 10000
n_classes = 10
ground_truth_class_weights = np.logspace(1,3,num=n_classes,base=10,dtype=float)  # exponentially growing
ground_truth_class_weights /= ground_truth_class_weights.sum()  # sum to 1
labels = np.random.choice(list(range(n_classes)), size=num_samples, p=ground_truth_class_weights)
labels = labels.reshape(-1, 1)  # turn each element into a list
labels = OneHotEncoder(sparse=False).fit_transform(labels)


print(f"original counts: {labels.sum(0)}")
# [  38.   76.  127.  191.  282.  556.  865. 1475. 2357. 4033.]

sample_weights = stratified_random_sample_weights(labels)
sample_size = 1000
chosen_indices = np.random.choice(list(range(num_samples)), size=sample_size, replace=True, p=sample_weights)

print(f"rebalanced counts: {labels[chosen_indices].sum(0)}")
# [104. 107.  88. 107.  94. 118.  92.  99. 100.  91.]

python - Scikit-learn 平衡子采样

15 回答 15

Related

Reference