r - SMOTE Algorithm and Classification: overrated prediction success

Question

I'm facing a problem about which I can't find any answer. I have a binary classification problem (output Y=0 or Y=1) with Y=1 the minority class (actually Y=1 indicates default of a company, with proportion=0.02 in the original dataframe). Therefore, I computed oversampling using SMOTE algorithm on my training set only (after splitting my dataframe in training and testing sets). I train a logistic regression on my training set (with proportions of class "defaut"=0.3) and then look at the ROC Curve and MSE to test whether my algorithm predicts well the default. I get very good results in terms both of AUC (AUC=0.89) and MSE (MSE=0.06). However, when I then try to look more preciselly and individually at my predictions, I find that 20% of default aren't well predicted. Do you have a method to evaluate well the quality of my prediction (quality means for me predictions that predict well default). I thought that AUC was a good criterium... So far do you also have a method in order to improve my regression? Thanks

score 3 · Accepted Answer

对于每个分类问题，您都可以建立一个混淆矩阵。

这是一个双向输入矩阵，让您不仅可以看到正确预测的真阳性/真阴性( TP/TN )，还可以看到假阳性 ( FP )/假阴性 ( FN )，这是大多数时候你真正的兴趣。

FP 和 FN 是您的模型所犯的错误，您可以通过使用敏感性或特异性（链接）来跟踪您的模型在检测 TP (1-FP) 或 TN (1-FN) 方面的表现。

请注意，您不能在不降低另一个的情况下改进一个。所以有时你需要选择一个。

一个很好的折衷方案是F1-score，它试图对两者进行平均。

因此，如果您对默认值更感兴趣（让我们想象一下defaults=Positive Class），您会更喜欢具有更高灵敏度的模型。但请记住也不要完全忽视特异性。

这是R中的示例代码：

# to get the confusion matrix and some metrics
caret::confusionMatrix(iris$Species, sample(iris$Species))

r - SMOTE Algorithm and Classification: overrated prediction success

1 回答 1

Related

Reference