我曾经gridsearchcv
对训练数据集的参数进行过调优KNearestNeighbors
,但令人惊讶的是,它返回的结果比测试集上的默认参数更差。为什么会发生这种情况?任何对适当使用的见解gridsearchcv
将不胜感激,我需要在几种算法上执行此操作,以将默认结果与超调结果进行比较。
gridsearchcv
代码:
# Parameters we want to try
param_grid = {'n_neighbors': [1, 2, 3, 5, 7],
'weights': ['uniform', 'distance'],
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'],
'leaf_size': [20, 30, 40],
'p': [1, 2, 3],
'metric': ['minkowski', 'chebyshev', 'manhattan', 'euclidean']}
# Define the grid search we want to run. Run it with four cpus in parallel.
gs_cv = GridSearchCV(KNeighborsClassifier(), param_grid, n_jobs=4)
# Run the grid search (should only be on training data!)
gs_cv.fit(train_X, train_y)
# Print the best parameters
print(gs_cv.best_params_)
#{'algorithm': 'auto', 'leaf_size': 20, 'metric': 'minkowski', 'n_neighbors': 7, 'p': 1, 'weights': 'uniform'}
使用这些参数的结果:
knn = KNeighborsClassifier(n_neighbors=7,
weights='uniform',
algorithm='auto',
leaf_size=20,
p=1,
metric='minkowski')
knn.fit(train_X, train_y)
print("="*30)
print('****Results****')
train_predictions = knn.predict(test_X)
acc = accuracy_score(test_y, train_predictions)
print("Accuracy: {:.2%}".format(acc))
train_predictions = knn.predict_proba(test_X)
ll = log_loss(test_y, train_predictions, labels=np.unique(train_y))
print("Log Loss: {:.4}".format(ll))
log_entry = pd.DataFrame([[name, acc*100, ll]], columns=log_cols)
log = log.append(log_entry)
==============================
****Results****
Accuracy: 87.50%
Log Loss: 0.3354
使用默认 KNN 参数:
knn = KNeighborsClassifier()
knn.fit(train_X, train_y)
print("="*30)
print('****Results****')
train_predictions = knn.predict(test_X)
acc = accuracy_score(test_y, train_predictions)
print("Accuracy: {:.2%}".format(acc))
train_predictions = knn.predict_proba(test_X)
ll = log_loss(test_y, train_predictions, labels=np.unique(train_y))
print("Log Loss: {:.4}".format(ll))
log_entry = pd.DataFrame([[name, acc*100, ll]], columns=log_cols)
log = log.append(log_entry)
==============================
****Results****
Accuracy: 91.67%
Log Loss: 0.2398