The following link provided me with a greater understanding of incorporating ordinary cost in my binary classification model: https://mlr.mlr-org.com/articles/tutorial/cost_sensitive_classif.html
With a standard classifier, the default threshold is usually 0.5, and the aim is to minimize the total number of misclassification errors as much as possible (obtain the maximum accuracy). However, all misclassification errors are treated equally. This is not typically the case in a real-world setting since the cost of a false negative may be much greater than that of a false negative.
Using empirical thresholding, I was able to obtain the optimal threshold value for classifying the instance into good or bad while minimizing the average cost. On the other hand, this comes at the price of reducing the accuracy and other performance measures. This is illustrated in the following figure:
In the figure above, the red line denotes the standard threshold of 0.5 which maximizes accuracy but gives a sub-optimal average credit cost. The blue line denotes the desired threshold that minimizes the cost, but now the accuracy is drastically reduced.
Generally, I would not be concerned about the reduced accuracy. Suppose however there is also an incentive to not only minimize the cost but also to maximize the precision as well. Note that the precision is the positive predictive value or ppv = TP/(TP+FP)). Then the green line might be a good trade-off that gives a relatively low cost and a relatively high ppv. Here, I plotted the green line as the average of the red and blue lines (both credit cost and ppv functions seem to have about the same gradient between these regions so calculating the optimal threshold this way probably provides a good estimate), but is there a way to calculate this threshold exactly?
My thoughts are to create a new performance measure as a function of both the costs and the ppv, and then minimize this performance measure. Example: measure = credit.costs*(-ppv)
But I'm not sure how to code this in R. Any advice on what should be done would be greatly appreciated.
My R code is as follows:
library(mlr)
## Load dataset
data(GermanCredit, package = "caret")
credit.task = makeClassifTask(data = GermanCredit, target = "Class")
## Removing 2 columns: Purpose.Vacation,Personal.Female.Single
credit.task = removeConstantFeatures(credit.task)
## Generate cost matrix
costs = matrix(c(0, 1, 5, 0), 2)
colnames(costs) = rownames(costs) = getTaskClassLevels(credit.task)
## Make cost measure
credit.costs = makeCostMeasure(id = "credit.costs", name = "Credit costs", costs = costs, best = 0, worst = 5)
## Set training scheme with repeated 10-fold cross-validation
set.seed(100)
rin = makeResampleInstance("RepCV", folds = 10, reps = 3, task = credit.task)
## Fit a logistic regression model (nnet::multinom())
lrn = makeLearner("classif.multinom", predict.type = "prob", trace = FALSE)
r = resample(lrn, credit.task, resampling = rin, measures = list(credit.costs, mmce), show.info = FALSE)
r
# Tune the threshold using average costs based on the predicted probabilities on the 3 test data sets
cost_tune.res = tuneThreshold(pred = r$pred, measure = credit.costs)
# Tune the threshold using precision based on the predicted probabilities on the 3 test data sets
ppv_tune.res = tuneThreshold(pred = r$pred, measure = ppv)
d = generateThreshVsPerfData(r, measures = list(credit.costs, ppv, acc))
plt = plotThreshVsPerf(d)
plt + geom_vline(xintercept=cost_tune.res$th, colour = "blue") + geom_vline(xintercept=0.5, colour = "red") +
geom_vline(xintercept=1/2*(cost_tune.res$th + 0.5), colour = "green")
calculateConfusionMatrix(r$pred)
performance(r$pred, measures = list(acc, ppv, credit.costs))
Finally, I'm also a bit confused that about my ppv value. When I observe my confusion matrix, I am calculating my ppv as 442/(442+289) = 0.6046512 but the reported value is slightly different (0.6053531). Is there something wrong with my calculation?