python - 为什么 GridSearchCv 在相同的代码中表现不同

Question

我正在尝试调用 GridSearchCV 以获得最佳估计器，如果我调用这样的参数

clf = DecisionTreeClassifier(random_state=42)

parameters = {'max_depth':[2,3,4,5,6,7,8,9,10],\
'min_samples_leaf':[2,3,4,5,6,7,8,9,10],\
'min_samples_split':[2,3,4,5,6,7,8,9,10]}

scorer = make_scorer(f1_score)

grid_obj = GridSearchCV(clf, parameters, scoring=scorer)

grid_fit = grid_obj.fit(X_train, y_train)

best_clf = grid_fit.best_estimator_

best_clf.fit(X_train, y_train)

best_train_predictions = best_clf.predict(X_train)
best_test_predictions = best_clf.predict(X_test)

print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, 
y_test))

结果将是

The training F1 Score is 0.784810126582
The testing F1 Score is 0.72

对于相同的数据，结果会有所不同我只将 [2,3,4,5,6,7,8,9,10] 更改为 [2,4,6,8,10]

clf = DecisionTreeClassifier(random_state=42)

parameters = {'max_depth':[2,4,6,8,10],'min_samples_leaf':[2,4,6,8,10],\
          'min_samples_split':[2,4,6,8,10] }

scorer = make_scorer(f1_score)

grid_obj = GridSearchCV(clf, parameters, scoring=scorer)
grid_fit = grid_obj.fit(X_train, y_train)
best_clf = grid_fit.best_estimator_
best_clf.fit(X_train, y_train)
best_train_predictions = best_clf.predict(X_train)
best_test_predictions = best_clf.predict(X_test)

print('The training F1 Score is', f1_score(best_train_predictions, y_train))
print('The testing F1 Score is', f1_score(best_test_predictions, y_test))

结果

The training F1 Score is 0.814814814815
The testing F1 Score is 0.8

对 GridsearchCV 的工作原理感到困惑

score 0 · Accepted Answer

通过更改网格搜索分析的值，您将针对不同的超参数集评估和比较您的模型。请记住 GridSearch 最终所做的是选择最佳的超参数集。

因此，在您的代码中，grid_fit.best_estimator_可能是不同的模型，这很自然地解释了为什么它们会在训练集和测试集上产生不同的分数。

在第一种情况下你可能有

clf = DecisionTreeClassifier(max_depth = 3, min_samples_leaf = 5, min_samples_split = 9)

在第二种情况下

clf = DecisionTreeClassifier(max_depth = 2, min_samples_leaf = 4, min_samples_split = 8)

（要检查它，你可以grid_fit.best_params_在每种情况下做）。

但是，您确实应该在第一种情况下获得更高的分数，因为您的第二次网格搜索使用的是第一次的参数子集。就像上面提到的@Attack68 一样，这可能是因为您在每一步都无法控制随机性。

python - 为什么 GridSearchCv 在相同的代码中表现不同

1 回答 1

Related

Reference