python-3.x - 对不平衡数据使用 sklearn.train_test_split

Question

我有一个非常不平衡的数据集。我使用 sklearn.train_test_split 函数来提取训练数据集。现在我想对训练数据集进行过采样，所以我曾经计算过 type1 的数量（我的数据集有 2 个类别和类型（type1 和 tupe2），但我的所有训练数据几乎都是 type1。所以我不能过采样。

以前我曾经用我的书面代码分割训练测试数据集。在该代码中，所有类型 1 数据的 0.8 和所有类型 2 数据的 0.8 都在训练数据集中。

如何将此方法与 train_test_split 函数或 sklearn 中的其他拆分方法一起使用？

*我应该只使用 sklearn 或我自己编写的方法。

score 9 · Accepted Answer

You're looking for stratification. Why?

There's a parameter stratify in method train_test_split to which you can give the labels list e.g. :

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y,
                                                    stratify=y, 
                                                    test_size=0.2)

There's also StratifiedShuffleSplit.

score 2 · Accepted Answer

看来我们俩在这里都有类似的问题。不幸的是，不平衡学习并不总是你需要的，scikit 也没有提供你想要的功能。您将需要实现自己的代码。

这就是我为我的申请提出的。请注意，我没有大量时间来调试它，但我相信它可以从我所做的测试中工作。希望能帮助到你：

def equal_sampler(classes, data, target, test_frac):
    
    # Find the least frequent class and its fraction of the total
    _, count = np.unique(target, return_counts=True)
    fraction_of_total = min(count) / len(target)
    
    # split further into train and test
    train_frac = (1-test_frac)*fraction_of_total
    test_frac = test_frac*fraction_of_total
    
    # initialize index arrays and find length of train and test
    train=[]
    train_len = int(train_frac * data.shape[0])
    test=[]
    test_len = int(test_frac* data.shape[0])
    
    # add values to train, drop them from the index and proceed to add to test
    for i in classes:
        indeces = list(target[target ==i].index.copy())
        train_temp = np.random.choice(indeces, train_len, replace=False)
        for val in train_temp:
            train.append(val)
            indeces.remove(val)
        test_temp = np.random.choice(indeces, test_len, replace=False)
        for val in test_temp:
            test.append(val)
    
    # X_train, y_train, X_test, y_test
    return data.loc[train], target[train], data.loc[test], target[test]

对于输入，classes 需要一个可能值的列表，data 需要用于预测的数据框列，target 需要目标列。

请注意，由于三重 for 循环（list.remove 需要线性时间），该算法可能不是非常有效。尽管如此，它应该相当快。

score 1 · Accepted Answer

您还可以按如下方式查看分层洗牌拆分：

 # We use a utility to generate artificial classification data.
 from sklearn.datasets import make_classification
 from sklearn.model_selection import StratifiedShuffleSplit
 from sklearn.svm import SVC
 from sklearn.pipeline import make_pipeline

 X, y = make_classification(n_samples=100, n_informative=10, n_classes=2)
 sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
 for train_index, test_index in sss.split(X, y):
    print("TRAIN:", train_index, "TEST:", test_index)
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    clf = make_pipeline(StandardScaler(), SVC(gamma='auto'))
    clf.fit(X_train, y_train)
    y_pred = clf.predict(X_test)

python-3.x - 对不平衡数据使用 sklearn.train_test_split

3 回答 3

Related

Reference