我用 mlxtend 构建了一个简单的堆叠分类器,并且正在尝试不同的基本分类器,我正面临一个有趣的情况。从我所有的研究来看,在我看来,堆叠分类器总是比它们的基础分类器表现得更好。
就我而言,当我在训练集上交叉验证堆叠分类器时,我得到的分数低于一些基本估计器。此外,我经常让我的堆叠分类器平均 CV 分数等于基本估计器的平均 CV 分数中的最低值。
这不是很奇怪吗?更奇怪的是,一旦我在我的堆叠分类器上执行 GridSearchCV,选择最佳参数并在整个训练集上重新训练,最后在测试集上计算准确度,我实际上得到了一个不错的分数。
我知道这种方法容易泄漏,并且有不同的技术可以对堆叠分类器进行 CV,但它们似乎非常慢,而且根据我的研究,上述方法似乎没问题(关于这种潜在的泄漏,这个 Kaggle Stacking 指南帖子甚至说“在实践中,每个人都忽略了这个理论漏洞(坦率地说,我认为大多数人甚至都不知道它的存在!” http://blog.kaggle.com/2016/12/27/a-kagglers-guide-to-model-stacking -in-practice/见参数调整段落)
from mlxtend.classifier import StackingCVClassifier
from sklearn import model_selection
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import BernoulliNB
from sklearn.discriminant_analysis import LinearDiscriminantAnalysis
from sklearn.metrics import accuracy_score
RANDOM_SEED = 12
#Imported df in separate code snippet
y = df['y']
X = df.drop(columns=['y'])
scaler = preprocessing.StandardScaler().fit(X)
X_transformed = scaler.transform(X)
X_train, X_test, y_train, y_test = train_test_split(X_transformed,y, random_state = 4)
def gridSearch_clf(clf, param_grid, X_train, y_train):
gs = GridSearchCV(clf, param_grid).fit(X_train, y_train)
print("Best Parameters")
print(gs.best_params_)
return gs.best_estimator_
def gs_report(y_test, X_test, best_estimator):
print(classification_report(y_test, best_estimator.predict(X_test)))
print("Overall Accuracy Score: ")
print(accuracy_score(y_test, best_estimator.predict(X_test)))
lr = LogisticRegression()
np.random.seed(RANDOM_SEED)
sclf = StackingCVClassifier(classifiers=[best_clf1, best_clf2, best_clf3],
meta_classifier=lr)
clfs = [best_clf1, best_clf2, best_clf3, sclf]
clf_names = [i.__class__.__name__ for i in clfs]
print_cv(clfs, clf_names)
Accuracy: 0.68 (+/- 0.30) [Decision Tree Classifier]
Accuracy: 0.55 (+/- 0.26) [K Neighbors Classifier]
Accuracy: 0.67 (+/- 0.32) [Bernoulli Naive Bayes]
Accuracy: 0.55 (+/- 0.26) [StackingClassifier]
## StackingClassifier Accuracy = KNN Classifier Accuracy
param_grid = {'meta-logisticregression__C':np.logspace(-2, 3, num=6, base=10)}
best_sclf = gridSearch_clf(sclf, param_grid, X_train, y_train)
gs_report(y_test,X_test, best_sclf)
Best Parameters
{'meta-logisticregression__C': 0.1}
precision recall f1-score support
0 0.91 0.99 0.95 9131
1 0.68 0.22 0.33 1166
avg / total 0.88 0.90 0.88 10297
Overall Accuracy Score:
0.9000679809653297