1

I am working on an data project assignment where I am asked to use 50% of data for training and remaining 50% of data for testing. I would like to use the magic of cross-validation and still meet the aforementioned criteria.

Currently, my code is following:

clf = LogisticRegression(penalty='l2', class_weight='balanced'

tprs = []
aucs = []
mean_fpr = np.linspace(0, 1, 100)

#cross validation
cv = StratifiedKFold(n_splits=2)
i = 0
for train, test in cv.split(X, y):
    probas_ = clf.fit(X[train], y[train]).predict_proba(X[test])
    # Compute ROC curve and area the curve
    fpr, tpr, thresholds = roc_curve(y[test], probas_[:, 1])
    tprs.append(interp(mean_fpr, fpr, tpr))
    tprs[-1][0] = 0.0
    roc_auc = auc(fpr, tpr)
    aucs.append(roc_auc)
    i += 1

print("Average AUC: ", sum(aucs)/len(aucs),"AUC: ", aucs[-1],)

Since I am using just 2 splits, is it considered as if I was using train-test split of 50:50? Or should I first split data into 50:50 and then use cross validation on the training part, and finally use that model to test the remaining 50% on test data?

4

1 回答 1

0

You should implement your second suggestion. Cross-validation should be used to tune the parameters of your approach. Among others, such parameters in your example are the value of the C parameter and the class_weight='balanced' of Logistic Regression. So you should:

  • split in 50% training, 50% test
  • use the training data to select the optimal values of the parameters of your model with cross-validation
  • Refit the model with the optimal parameters on the training data
  • Predict for the test data and report the score of the evaluation measure you selected

Notice, you should use the test data only for reporting the final score and not for tuning the model, otherwise you are cheating. Imagine, that in reality you may not have access to them until the last moment, so you can not use them.

于 2018-01-22T08:25:13.120 回答