I am working on an data project assignment where I am asked to use 50% of data for training and remaining 50% of data for testing. I would like to use the magic of cross-validation and still meet the aforementioned criteria.
Currently, my code is following:
clf = LogisticRegression(penalty='l2', class_weight='balanced'
tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)
#cross validation
cv = StratifiedKFold(n_splits=2)
i = 0
for train, test in cv.split(X, y):
probas_ = clf.fit(X[train], y[train]).predict_proba(X[test])
# Compute ROC curve and area the curve
fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
tprs.append(interp(mean_fpr, fpr, tpr))
tprs[-1][0] = 0.0
roc_auc = auc(fpr, tpr)
aucs.append(roc_auc)
i += 1
print("Average AUC: ", sum(aucs)/len(aucs),"AUC: ", aucs[-1],)
Since I am using just 2 splits, is it considered as if I was using train-test split of 50:50? Or should I first split data into 50:50 and then use cross validation on the training part, and finally use that model to test the remaining 50% on test data?