machine-learning - 如何故意过拟合 Weka 树分类器？

Question

我有一个二进制类数据集（0 / 1），对“0”类有很大的倾斜（大约 30000 对 1500）。每个实例有 7 个特征，没有缺失值。

当我使用 J48 或任何其他树分类器时，几乎所有“1”实例都被错误分类为“0”。

将分类器设置为“未修剪”，将每个叶子的最小实例数设置为 1，将置信度因子设置为 1，添加一个带有实例 ID 号的虚拟属性——所有这些都没有帮助。

我只是无法创建一个过度拟合我的数据的模型！

我也尝试了 Weka 提供的几乎所有其他分类器，但得到了类似的结果。

使用 IB1 可以获得 100% 的准确率（trainset on trainset），因此具有相同特征值和不同类的多个实例不是问题。

如何创建完全未修剪的树？或者以其他方式迫使 Weka 过度拟合我的数据？

谢谢。

更新：好的，这很荒谬。我只使用了大约 3100 个负例和 1200 个正例，这就是我得到的树（未修剪！）：

J48 unpruned tree
------------------

F <= 0.90747: 1 (201.0/54.0)
F > 0.90747: 0 (4153.0/1062.0)

不用说，IB1 仍然提供 100% 的精度。

更新 2：不知道我是怎么错过的 - 未修剪的 SimpleCart 工作并在火车上提供 100% 准确度的火车；修剪后的 SimpleCart 不像 J48 那样有偏见，并且具有不错的误报率和误报率。

score 5 · Accepted Answer

Weka contains two meta-classifiers of interest:

They allows you to make any algorithm cost-sensitive (not restricted to SVM) and to specify a cost matrix (penalty of the various errors); you would give a higher penalty for misclassifying 1 instances as 0 than you would give for erroneously classifying 0 as 1.

The result is that the algorithm would then try to:

minimize expected misclassification cost (rather than the most likely class)

score 2 · Accepted Answer

The quick and dirty solution is to resample. Throw away all but 1500 of your positive examples and train on a balanced data set. I am pretty sure there is a resample component in Weka to do this.

The other solution is to use a classifier with a variable cost for each class. I'm pretty sure libSVM allows you to do this and I know Weka can wrap libSVM. However I haven't used Weka in a while so I can't be of much practical help here.

machine-learning - 如何故意过拟合 Weka 树分类器？

2 回答 2

Related

Reference