EDIT: As Dwin pointed out in the comments, the code below is not for an ROC curve. An ROC curve must be indexed in variation in t
and not in lambda
(as I do below). I will edit the code below when I get the chance.
Below is my attempt to create an ROC curve of glmnet predicting a binary outcome. I've simulated a matrix that approximates glmnet results in the code below. As some of you know, given an n x p matrix of inputs, glmnet outputs an n x 100 matrix of predicted probabilities [$\Pr(y_i = 1)$] for 100 different values of lambda. The output will be narrower than 100 if further changes in lambda stop increasing predictive power. The simulated matrix of glmnet predicted probabilities below is a 250x69 matrix.
First, is there an easier way to plot a glmnet ROC curve? Second, if not, does the below approach seem correct? Third, do I care about plotting (1) the probability of false/true positives OR (2) simply the observed rate of false/true positives?
set.seed(06511)
# Simulate predictions matrix
phat = as.matrix(rnorm(250,mean=0.35, sd = 0.12))
lambda_effect = as.matrix(seq(from = 1.01, to = 1.35, by = 0.005))
phat = phat %*% t(lambda_effect)
#Choose a cut-point
t = 0.5
#Define a predictions matrix
predictions = ifelse(phat >= t, 1, 0)
##Simulate y matrix
y_phat = apply(phat, 1, mean) + rnorm(250,0.05,0.10)
y_obs = ifelse(y_phat >= 0.55, 1, 0)
#percentage of 1 observations in the validation set,
p = length(which(y_obs==1))/length(y_obs)
# dim(testframe2_e2)
#probability of the model predicting 1 while the true value of the observation is 0,
apply(predictions, 1, sum)
## Count false positives for each model
## False pos ==1, correct == 0, false neg == -1
error_mat = predictions - y_obs
## Define a matrix that isolates false positives
error_mat_fp = ifelse(error_mat ==1, 1, 0)
false_pos_rate = apply(error_mat_fp, 2, sum)/length(y_obs)
# Count true positives for each model
## True pos == 2, mistakes == 1, true neg == 0
error_mat2 = predictions + y_obs
## Isolate true positives
error_mat_tp = ifelse(error_mat2 ==2, 1, 0)
true_pos_rate = apply(error_mat_tp, 2, sum)/length(y_obs)
## Do I care about (1) this probability OR (2) simply the observed rate?
## (1)
#probability of false-positive,
p_fp = false_pos_rate/(1-p)
#probability of true-positive,
p_tp = true_pos_rate/p
#plot the ROC,
plot(p_fp, p_tp)
## (2)
plot(false_pos_rate, true_pos_rate)
There's one question on this on SO, but the answer was rough and not quite right: glmnet lasso ROC charts