python - 复制训练示例以处理 pandas 数据框中的类不平衡

Question

我在 pandas 中有一个包含训练示例的 DataFrame，例如：

   feature1  feature2  class
0  0.548814  0.791725      1
1  0.715189  0.528895      0
2  0.602763  0.568045      0
3  0.544883  0.925597      0
4  0.423655  0.071036      0
5  0.645894  0.087129      0
6  0.437587  0.020218      0
7  0.891773  0.832620      1
8  0.963663  0.778157      0
9  0.383442  0.870012      0

我使用以下方法生成：

import pandas as pd
import numpy as np

np.random.seed(0)
number_of_samples = 10
frame = pd.DataFrame({
    'feature1': np.random.random(number_of_samples),
    'feature2': np.random.random(number_of_samples),
    'class':    np.random.binomial(2, 0.1, size=number_of_samples), 
    },columns=['feature1','feature2','class'])

print(frame)

如您所见，训练集是不平衡的（8 个样本属于 0 类，而只有 2 个样本属于 1 类）。我想对训练集进行过采样。具体来说，我想复制第 1 类的训练样本，以使训练集保持平衡（即，第 0 类的样本数量与第 1 类的样本数量大致相同）。我该怎么做？

理想情况下，我想要一个可以推广到多类设置的解决方案（即，类列中的整数可能大于 1）。

score 24 · Accepted Answer

您可以找到一个组的最大大小

max_size = frame['class'].value_counts().max()

在您的示例中，这等于 8。对于每个组，您可以使用替换max_size - len(group_size)元素进行采样。这样，如果您将这些连接到原始 DataFrame，它们的大小将是相同的，并且您将保留原始行。

lst = [frame]
for class_index, group in frame.groupby('class'):
    lst.append(group.sample(max_size-len(group), replace=True))
frame_new = pd.concat(lst)

您可以使用它max_size-len(group)并可能添加一些噪音，因为这将使所有组大小相等。

python - 复制训练示例以处理 pandas 数据框中的类不平衡

1 回答 1

Related

Reference