您当然可以调整控制随机选择的特征数量的超参数,以从引导数据中生成每棵树。通常,您通过 k 折交叉验证来执行此操作;选择最小化测试样本预测误差的调整参数。此外,种植更大的森林将提高预测的准确性,尽管一旦种植了数百棵树,回报通常会递减。
试试这个示例代码。
from sklearn.ensemble import RandomForestRegressor
rf = RandomForestRegressor(random_state = 42)
from pprint import pprint # Look at parameters used by our current forest
print(rf.get_params())
结果:
{'bootstrap': True, 'ccp_alpha': 0.0, 'criterion': 'mse', 'max_depth': None, 'max_features': 'auto', 'max_leaf_nodes': None, 'max_samples': None, 'min_impurity_decrease': 0.0, 'min_impurity_split': None, 'min_samples_leaf': 1, 'min_samples_split': 2, 'min_weight_fraction_leaf': 0.0, 'n_estimators': 100, 'n_jobs': None, 'oob_score': False, 'random_state': None, 'verbose': 0, 'warm_start': False}
还...
import numpy as np
from sklearn.model_selection import RandomizedSearchCV # Number of trees in random forest
n_estimators = [int(x) for x in np.linspace(start = 200, stop = 2000, num = 10)]
# Number of features to consider at every split
max_features = ['auto', 'sqrt']
# Maximum number of levels in tree
max_depth = [int(x) for x in np.linspace(10, 110, num = 11)]
max_depth.append(None)
# Minimum number of samples required to split a node
min_samples_split = [2, 5, 10]
# Minimum number of samples required at each leaf node
min_samples_leaf = [1, 2, 4]
# Method of selecting samples for training each tree
bootstrap = [True, False]# Create the random grid
random_grid = {'n_estimators': n_estimators,
'max_features': max_features,
'max_depth': max_depth,
'min_samples_split': min_samples_split,
'min_samples_leaf': min_samples_leaf,
'bootstrap': bootstrap}
pprint(random_grid)
结果:
{'bootstrap': [True, False],
'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, None],
'max_features': ['auto', 'sqrt'],
'min_samples_leaf': [1, 2, 4],
'min_samples_split': [2, 5, 10],
'n_estimators': [200, 400, 600, 800, 1000, 1200, 1400, 1600, 1800, 2000]}
有关更多信息,请参阅此链接。
https://towardsdatascience.com/optimizing-hyperparameters-in-random-forest-classification-ec7741f9d3f6
这是一些进行交叉验证的示例代码。
# import random search, random forest, iris data, and distributions
from sklearn.model_selection import cross_validate
from sklearn import datasets
from sklearn.ensemble import RandomForestClassifier
# get iris data
iris = datasets.load_iris()
X = iris.data
y = iris.target
model = RandomForestClassifier(random_state=1)
cv = cross_validate(model, X, y, cv=5)
print(cv)
print(cv['test_score'])
print(cv['test_score'].mean())
结果:
{'fit_time': array([0.18350697, 0.14461398, 0.14261866, 0.13116884, 0.15478826]), 'score_time': array([0.01496148, 0.00997281, 0.00897574, 0.00797844, 0.01396227]), 'test_score': array([0.96666667, 0.96666667, 0.93333333, 0.96666667, 1. ])}
[0.96666667 0.96666667 0.93333333 0.96666667 1. ]
0.9666666666666668
交叉验证的内部工作:
Shuffle the dataset in order to remove any kind of order
Split the data into K number of folds. K= 5 or 10 will work for most of the cases
Now keep one fold for testing and remaining all the folds for training
Train(fit) the model on train set and test(evaluate) it on test set and note down the results for that split
Now repeat this process for all the folds, every time choosing separate fold as test data
So for every iteration our model gets trained and tested on different sets of data
At the end sum up the scores from each split and get the mean score