我正在使用python
并scikit-learn
进行一些交叉验证测试。目前,我将 pandas 数据框拆分为训练集(X_train
, y_train
)和测试集(X_test
, y_test
),然后对训练集执行随机 3 折交叉验证,并使用网格搜索中的最终参数在我的测试集:
from sklearn import cross_validation, ensemble, grid_search
X_train, X_test, y_train, y_test = cross_validation.train_test_split(
X_data, y_data, train_size=.5, random_state=1
)
N_cv = y_train.shape[0]
kf = cross_validation.KFold(n=N_cv, n_folds=folds, shuffle=True, random_state=None)
#Generate Gradient Boosting Regression
#Set up grid search parameters
gb_learning_grid = exponent(2, range(-3, 2, 1) )
gb_estimators_grid = [100, 200, 300]
gb_minleaf_grid = [25, 50, 75]
gradientboost_grid = ensemble.GradientBoostingRegressor()
gradientboost_param = {'learning_rate':gb_learning_grid, 'n_estimators':gb_estimators_grid, 'min_samples_leaf':gb_minleaf_grid}
#Stage 1 Grid Search
stage1_gb_model = grid_search.GridSearchCV(estimator=gradientboost_grid, param_grid=gradientboost_param, n_jobs=jobs, cv=kf)
gradientboost_CV1 = stage1_gb_model.fit(X=X_train, y=y_true)
best_estimators_gb = gradientboost_CV1.best_params_['n_estimators']
best_learning_gb = gradientboost_CV1.best_params_['learning_rate']
best_minleaf_gb = gradientboost_CV1.best_params_['min_samples_leaf']
#Stage 2 Grid Search
gradientboost_grid = ensemble.GradientBoostingRegressor(
min_samples_leaf=best_minleaf_gb, n_estimators=best_estimators_gb)
stage2_learning_gb = drange(best_learning_gb-0.025, best_learning_gb+0.025, 0.00625)
stage2_learning_gb = [float(x) for x in stage2_learning_gb]
stage2_gb_param = {'learning_rate':stage2_learning_gb}
stage2_gb_model = grid_search.GridSearchCV(estimator=gradientboost_grid, param_grid=stage2_gb_param, n_jobs=jobs, cv=kf, scoring=scoring)
gradientboost_CV2 = stage2_gb_model.fit(X=X_train, y=y_true)
best_learning_gb = gradientboost_CV2.best_params_['learning_rate']
#Generate Primary Model
final_gbr = ensemble.GradientBoostingRegressor(n_estimators=best_estimators_gb, learning_rate=best_learning_gb, min_samples_leaf=best_minleaf_gb)
final_fit = final_gbr.fit(X_train, y_true)
final_predict = final_fit.predict(X_test)
因此,能够执行这些类型的随机 k 折网格搜索很酷,但是sklearn
库中是否有本地方法可以对特定数据集进行网格搜索。为了更准确地说我上面的代码,是否有一种本地方法sklearn
来开发模型X_train
,y_train
使用给定的参数,其中网格搜索的结果最佳参数由结果模型对特定数据集的拟合确定,而不是随机的的 k 倍X_train
,y_train
?