1

I'm using scikit-learn 0.13.1 for a contest on Kaggle. I'm using a Decision Tree classifier, and to evaluate my estimator I follow the techniques of splitting the training data via train_test_split, or doing cross validation using cross_val_score. Either technique would show that the estimator is about 90% accurate. However, when I use the estimator on actual test data, the accuracy obtained is about 30% lower. Let's assume that the training data is a good representation of the test data.

What else can I do to evaluate the accuracy of the estimator?

clf = tree.DecisionTreeClassifier( )
...
X_train, X_test, y_train, y_test = train_test_split(train, target, test_size=0.3, random_state=42)
...
clf.fit(X_train, y_train)
print "Accuracy: %0.2f " % clf.score(X_test, y_test)
...    
scores = cv.cross_val_score(clf, train, target, cv=15)
print "Accuracy: %0.2f (+/- %0.2f)" % (scores.mean(), scores.std() / 2)
4

1 回答 1

5

这可能意味着最终评估数据的分布与开发集之间存在显着差异。

不过,测量决策树的过度拟合会很有趣:训练分数clf.score(X_train, y_train)和测试分数之间的差异是什么clf.score(X_test, y_test)

纯决策树也应被视为玩具分类器。它们的泛化特性非常差(并且可能会过度拟合)。你真的应该尝试ExtraTreesClassifier增加n_estimators. 如果数据集足够小,则从开始n_estimators=10,然后是 50、100、500、1000。

于 2013-06-07T18:58:53.920 回答