See the section titled "Balancing prediction error" from the official documentation on random forests here: https://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm#balance
I marked some parts in bold.
In summary, this seems to suggest that your training and test data should either
- reflect the 1:4 ratio of classes that your real-life data will have
or
- you can have a 1:1 mix, but then you should carefully adjust the
weights per class as demonstrated below till the OOB error rate on
your desired (smaller) class is lowered
Hope that helps.
In some data sets, the prediction error between classes is highly
unbalanced. Some classes have a low prediction error, others a high.
This occurs usually when one class is much larger than another. Then
random forests, trying to minimize overall error rate, will keep the
error rate low on the large class while letting the smaller classes
have a larger error rate. For instance, in drug discovery, where a
given molecule is classified as active or not, it is common to have
the actives outnumbered by 10 to 1, up to 100 to 1. In these
situations the error rate on the interesting class (actives) will be
very high.
The user can detect the imbalance by outputs the error rates for the
individual classes. To illustrate 20 dimensional synthetic data is
used. Class 1 occurs in one spherical Gaussian, class 2 on another. A
training set of 1000 class 1's and 50 class 2's is generated, together
with a test set of 5000 class 1's and 250 class 2's.
The final output of a forest of 500 trees on this data is:
500 3.7 0.0 78.4
There is a low overall test set error (3.73%) but class 2 has over 3/4
of its cases misclassified.
The error balancing can be done by setting different weights for
the classes.
The higher the weight a class is given, the more its error rate is
decreased. A guide as to what weights to give is to make them
inversely proportional to the class populations. So set weights to 1
on class 1, and 20 on class 2, and run again. The output is:
500 12.1 12.7 0.0
The weight of 20 on class 2 is too high. Set it to 10 and try again,
getting:
500 4.3 4.2 5.2
This is pretty close to balance. If exact balance is wanted, the
weight on class 2 could be jiggled around a bit more.
Note that in getting this balance, the overall error rate went up.
This is the usual result - to get better balance, the overall error
rate will be increased.