machine-learning - 用于训练基于随机森林的二元分类器的正负训练示例的正确比率

Question

我意识到训练集中的相关问题 Positives/negatives ratio 表明，正负训练示例的 1:1 比率有利于 Rocchio 算法。

但是，此问题与相关问题的不同之处在于它涉及随机森林模型，并且还涉及以下两种方式。

1）我有大量的训练数据可以使用，使用更多训练示例的主要瓶颈是训练迭代时间。也就是说，我不想花一个晚上的时间来训练一个排名，因为我想快速迭代。

2）在实践中，分类器可能每 4 个负例就会看到 1 个正例。

在这种情况下，我应该使用比正例更多的负例进行训练，还是仍然使用相同数量的正例和负例？

score 6 · Accepted Answer

See the section titled "Balancing prediction error" from the official documentation on random forests here: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance

I marked some parts in bold.

In summary, this seems to suggest that your training and test data should either

reflect the 1:4 ratio of classes that your real-life data will have or
you can have a 1:1 mix, but then you should carefully adjust the weights per class as demonstrated below till the OOB error rate on your desired (smaller) class is lowered

Hope that helps.

In some data sets, the prediction error between classes is highly unbalanced. Some classes have a low prediction error, others a high. This occurs usually when one class is much larger than another. Then random forests, trying to minimize overall error rate, will keep the error rate low on the large class while letting the smaller classes have a larger error rate. For instance, in drug discovery, where a given molecule is classified as active or not, it is common to have the actives outnumbered by 10 to 1, up to 100 to 1. In these situations the error rate on the interesting class (actives) will be very high.

The user can detect the imbalance by outputs the error rates for the individual classes. To illustrate 20 dimensional synthetic data is used. Class 1 occurs in one spherical Gaussian, class 2 on another. A training set of 1000 class 1's and 50 class 2's is generated, together with a test set of 5000 class 1's and 250 class 2's.

The final output of a forest of 500 trees on this data is:

500 3.7 0.0 78.4

There is a low overall test set error (3.73%) but class 2 has over 3/4 of its cases misclassified.

The error balancing can be done by setting different weights for the classes.

The higher the weight a class is given, the more its error rate is decreased. A guide as to what weights to give is to make them inversely proportional to the class populations. So set weights to 1 on class 1, and 20 on class 2, and run again. The output is:

500 12.1 12.7 0.0

The weight of 20 on class 2 is too high. Set it to 10 and try again, getting:

500 4.3 4.2 5.2

This is pretty close to balance. If exact balance is wanted, the weight on class 2 could be jiggled around a bit more.

Note that in getting this balance, the overall error rate went up. This is the usual result - to get better balance, the overall error rate will be increased.

score 3 · Accepted Answer

这似乎是一个微不足道的答案，但我能建议的最好的事情是尝试一小部分数据（足够小，算法可以快速训练），并观察使用 1-1、1-2 时的准确度, 1-3 等...

当您逐渐增加每个比率的示例总数时绘制结果，并查看性能如何响应。很多时候，您会发现部分数据非常接近在完整数据集上训练的性能，在这种情况下，您可以对您的问题做出明智的决定。

希望有帮助。

machine-learning - 用于训练基于随机森林的二元分类器的正负训练示例的正确比率

2 回答 2

Related

Reference