machine-learning - 我可以知道处理不平衡数据集的正确方法是什么吗？

Question

我是 DataScience 的新手，在这里澄清一些疑问。我有一个不平衡的数据集，其中包含 3 个主要称为 1、2、3 的类。“2”占多数（56.89%），“1”占 9.6%，“3”占 33.4%。请问我知道处理不平衡数据集的正确程序是什么，并希望最终有更高的预测精度。

现在我正在做的是，

1) 将数据集拆分为 70:30（训练/测试）

2）使用SMOTE使其平衡

3）尝试使用特征选择来找到最重要的特征并重新转换到新的训练集进行测试。但它面临一个错误。

我的 Jupyter 笔记本在第 3 步后遇到错误，MemoryError: could not allocate 14680064 bytes。我也可以知道为什么吗？非常感谢您，任何建议或帮助表示赞赏！

score 1 · Accepted Answer

请不要在多类问题中使用准确性。

解决方案取决于您真正想要什么，少数族裔是否与多数族裔同等重要。

关于处理，您可以做的一件事是通过将多数类的样本空间减少到相当于少数类的样本空间来使您的数据集在训练时保持平衡，如果这些数据点太小，那么也许您可以制作 2 级分类器。关于创建人工数据点 (SMOTE)，它有时可能有效，也可能无效，这取决于问题，因此请说明您的问题。计算并提供 PRFS，以便更好地了解您真正想要实现的目标。

关于内存错误，您有一些变量要求比您的系统可以处理的更多，我的意思是系统保留了一些额外的空间，而您正在超出这个范围，或者我们在数据科学中面临的最可爱的因素是“维度诅咒”。

score 0 · Accepted Answer

这是一个供您考虑的通用示例。

import pandas as pd
import numpy as np

# Read dataset
df = pd.read_csv('balance-scale.data', 
                 names=['balance', 'var1', 'var2', 'var3', 'var4'])

# Display example observations
df.head()

df['balance'].value_counts()
# R    288
# L    288
# B     49
# Name: balance, dtype: int64

# Transform into binary classification
df['balance'] = [1 if b=='B' else 0 for b in df.balance]

df['balance'].value_counts()
# 0    576
# 1     49
# Name: balance, dtype: int64
# About 8% were balanced

from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score

# Next, we'll fit a very simple model using default settings for everything.
# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)

# Train model
clf_0 = LogisticRegression().fit(X, y)

# Predict on training set
pred_y_0 = clf_0.predict(X)

# How's the accuracy?
print( accuracy_score(pred_y_0, y) )
# 0.9216

# So our model has 92% overall accuracy, but is it because it's predicting only 1 class?
# Should we be excited?
print( np.unique( pred_y_0 ) )
# [0]

# at this point, we need to use RESAMPLING!
from sklearn.utils import resample
# Separate majority and minority classes
# upsample the miniority class
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]

# Upsample minority class
df_minority_upsampled = resample(df_minority, 
                                 replace=True,     # sample with replacement
                                 n_samples=576,    # to match majority class
                                 random_state=123) # reproducible results

# Combine majority class with upsampled minority class
df_upsampled = pd.concat([df_majority, df_minority_upsampled])

# Display new class counts
df_upsampled.balance.value_counts()
# 1    576
# 0    576
# Name: balance, dtype: int64

# Separate input features (X) and target variable (y)
y = df_upsampled.balance
X = df_upsampled.drop('balance', axis=1)

# Train model
clf_1 = LogisticRegression().fit(X, y)

# Predict on training set
pred_y_1 = clf_1.predict(X)

# Is our model still predicting just one class?
print( np.unique( pred_y_1 ) )
# [0 1]

# How's our accuracy?
print( accuracy_score(y, pred_y_1) )
# 0.513888888889

# Great, now the model is no longer predicting just one class. While the accuracy also # took a nosedive, it's now more meaningful as a performance metric.

# now we need to downsample the majority class
# Separate majority and minority classes
df_majority = df[df.balance==0]
df_minority = df[df.balance==1]

# Downsample majority class
df_majority_downsampled = resample(df_majority, 
                                 replace=False,    # sample without replacement
                                 n_samples=49,     # to match minority class
                                 random_state=123) # reproducible results

# Combine minority class with downsampled majority class
df_downsampled = pd.concat([df_majority_downsampled, df_minority])

# Display new class counts
df_downsampled.balance.value_counts()
# 1    49
# 0    49
# Name: balance, dtype: int64

# Separate input features (X) and target variable (y)
y = df_downsampled.balance
X = df_downsampled.drop('balance', axis=1)

# Train model
clf_2 = LogisticRegression().fit(X, y)

# Predict on training set
pred_y_2 = clf_2.predict(X)

# Is our model still predicting just one class?
print( np.unique( pred_y_2 ) )
# [0 1]

# How's our accuracy?
print( accuracy_score(y, pred_y_2) )
# 0.581632653061

永远记住，随机森林算法可以很好地处理不平衡的数据集，所以也许这就是你所需要的！我通常从随机森林开始每个实验。如果这产生了我想要的结果，我就完成了。无需寻找和挑选宇宙中最好的算法。您也可以轻松地自动化在任何给定数据集上测试数十种算法的过程。

# Separate input features (X) and target variable (y)
y = df.balance
X = df.drop('balance', axis=1)

# Train model
clf_4 = RandomForestClassifier()
clf_4.fit(X, y)

# Predict on training set
pred_y_4 = clf_4.predict(X)

# Is our model still predicting just one class?
print( np.unique( pred_y_4 ) )
# [0 1]

# How's our accuracy?
print( accuracy_score(y, pred_y_4) )
# 0.9744

# What about AUROC?
prob_y_4 = clf_4.predict_proba(X)
prob_y_4 = [p[1] for p in prob_y_4]
print( roc_auc_score(y, prob_y_4) )
# 0.999078798186

参考：

https://elitedatascience.com/imbalanced-classes

machine-learning - 我可以知道处理不平衡数据集的正确方法是什么吗？

2 回答 2

Related

Reference