python - KNN 平衡数据后找不到类

Question

我有一个奇怪的问题，我有一个包含 4 个集群的模型，数据不平衡的比例如下：75%、15%、7% 和 3%。我将它分成训练和测试，比例为 80/20，然后我训练一个有 5 个邻居的 KNN，给我 1 的准确率。

sss = StratifiedShuffleSplit(n_splits=5, test_size=0.2, random_state=42)

train_index, test_index = next(sss.split(X, y))

x_train, y_train = X[train_index], y[train_index]
x_test, y_test = X[test_index], y[test_index]

KNN_final = KNeighborsClassifier()
KNN_final.fit(x_train, y_train)

y_pred = KNN_final.predict(x_test)

print('Avg. accuracy for all classes:', metrics.accuracy_score(y_test, y_pred))
print('Classification report: \n',metrics.classification_report(y_test, y_pred, digits=2))

Avg. accuracy for all classes: 1.0
Classification report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       140
           1       1.00      1.00      1.00        60
           2       1.00      1.00      1.00       300
           3       1.00      1.00      1.00      1500

    accuracy                           1.00      2000
   macro avg       1.00      1.00      1.00      2000
weighted avg       1.00      1.00      1.00      2000

虽然看起来很奇怪，但我继续，获取新数据并尝试根据这个模型对其进行分类，但它永远不会找到百分比较小的类，它总是将其错误分类为第二低类。所以我尝试使用带有 SMOTEENN 算法的不平衡学习库来平衡数据：

Original dataset shape Counter({3: 7500, 2: 1500, 0: 700, 1: 300})

sme = SMOTEENN(sampling_strategy='all', random_state=42)
X_res, y_res = sme.fit_resample(X, y)
print('Resampled dataset shape %s' % Counter(y_res))

Resampled dataset shape Counter({0: 7500, 1: 7500, 2: 7500, 3: 7500})

然后我做同样的事情，将它分成相同比例 80/20 的训练和测试，并用 5 个邻居训练一个新的 KNN 分类器。但分类报告现在似乎更糟了：

Avg. accuracy for all classes: 1.0
Classification report: 
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      1500
           1       1.00      1.00      1.00       500

    accuracy                           1.00      2000
   macro avg       1.00      1.00      1.00      2000
weighted avg       1.00      1.00      1.00      2000

我看不出我做错了什么，在训练新分类器之前，除了拆分和洗牌之外，在重新采样数据之后我还需要做什么流程吗？为什么我的 KNN 现在没有看到 4 个类？

score 0 · Accepted Answer

虽然全面调查需要您的数据，但您没有提供这些数据，但这种行为（至少部分）符合以下情况：

您的初始数据中有重复项（可能很多）
由于这些重复，您的一些（大多数？全部？）测试数据实际上不是新的/看不见的，而是训练数据中样本的副本，这导致测试准确度不合理地达到 1.0
当添加新数据（不与您的初始数据重复）时，该模型不出所料地未能满足测试数据中如此高的准确度（1.0）所产生的期望。

请注意，分层拆分不会保护您免受这种情况的影响；这是一个带有玩具数据的演示，改编自文档：

import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit

X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
y = np.array([0, 1, 0, 1, 0, 1])

sss = StratifiedShuffleSplit(n_splits=5, test_size=0.5, random_state=0)
train_index, test_index = next(sss.split(X, y))

X[train_index]
# result:
array([[3, 4],
       [1, 2],
       [3, 4]])

X_[test_index]
# result:
array([[3, 4],
       [1, 2],
       [1, 2]])

python - KNN 平衡数据后找不到类

1 回答 1

Related

Reference