0

I'm facing a problem about which I can't find any answer. I have a binary classification problem (output Y=0 or Y=1) with Y=1 the minority class (actually Y=1 indicates default of a company, with proportion=0.02 in the original dataframe). Therefore, I computed oversampling using SMOTE algorithm on my training set only (after splitting my dataframe in training and testing sets). I train a logistic regression on my training set (with proportions of class "defaut"=0.3) and then look at the ROC Curve and MSE to test whether my algorithm predicts well the default. I get very good results in terms both of AUC (AUC=0.89) and MSE (MSE=0.06). However, when I then try to look more preciselly and individually at my predictions, I find that 20% of default aren't well predicted. Do you have a method to evaluate well the quality of my prediction (quality means for me predictions that predict well default). I thought that AUC was a good criterium... So far do you also have a method in order to improve my regression? Thanks

4

1 回答 1

3

对于每个分类问题,您都可以建立一个混淆矩阵

这是一个双向输入矩阵,让您不仅可以看到正确预测的真阳性/真阴性( TP/TN ),还可以看到假阳性 ( FP )/假阴性 ( FN ),这是大多数时候你真正的兴趣。

FP 和 FN 是您的模型所犯的错误,您可以通过使用敏感性特异性链接)来跟踪您的模型在检测 TP (1-FP) 或 TN (1-FN) 方面的表现。

请注意,您不能在不降低另一个的情况下改进一个。所以有时你需要选择一个。

一个很好的折衷方案是F1-score,它试图对两者进行平均。

因此,如果您对默认值更感兴趣(让我们想象一下defaults=Positive Class),您会更喜欢具有更高灵敏度的模型。但请记住也不要完全忽视特异性。

这是R中的示例代码:

# to get the confusion matrix and some metrics
caret::confusionMatrix(iris$Species, sample(iris$Species))
于 2018-11-06T14:03:14.380 回答