pandas - 大熊猫中多标签不平衡数据集的欠采样

问问题 2017-05-31T19:51:59.977

1012 次

我正在研究一个自己滚动的欠采样功能，因为imblearn它不能很好地与多标签分类一起工作（例如它只接受一维y）。

我想遍历 X 和 y，每 2 或 3 行删除一个属于多数类的行。目标是减少多数类中的行数的快速而肮脏的方法。

def undersample(X, y):
    counter = 0
    for index, row in y.itertuples():
        if row['rectangle_here'] == 0:
            counter += 1
            if counter > 3:
                counter = 0
                X.drop(index, inplace=True)
                y.drop(index, inplace=True)
    return X, y

但即使在少量行（约 30,000 行）上，它也会使我的内核崩溃。

y是这样的东西，在任何时候f2或f3在场f1的地方，在场

因此，让我们计算 0 发生的次数f1，然后每 3 次删除 0 行：

                  f1      f2       f3
0                  0       0       0
1                  0       0       0
2                  0       0       0
3                  1       0       1
4                  0       0       0
5                  0       0       0
6                  0       0       0
7                  0       0       0
8                  0       0       0
9                  0       0       0

pandas - 大熊猫中多标签不平衡数据集的欠采样

0 回答 0

Related

Reference