我正在尝试预测应该预测连续变量的随机森林 RandomizedSearchCV 的最佳参数。
我一直在研究以下方法,特别是更改评分函数并最终解决回归逻辑函数median_absolute_error。但是,我认为KFold交叉验证不适合我的数据,但我不明白如何使用可迭代的 cv(例如, https: //scikit-learn.org/stable/modules/generated/ sklearn.model_selection.RandomizedSearchCV.html)因为我无法在RandomizedSearchCV之前运行(据我所知)拟合和预测我的模型
def my_custom_score(y_true, y_pred, dates_, features, labels):
return median_absolute_error(y_true, y_pred)
...
for i in range(0, 3): #predict 3 10-point intervals
prediction_colour = ['g','r','c','m','y','k','w'][i%7]
date_for_test = randint(11, 200) #end of the trend
dates_for_test = range(date_for_test-10, date_for_test) #one predicted interval should have 10 date points
for idx, date_for_test_ in enumerate(sorted(dates_for_test, reverse=True)):
train_features = features[sorted(dates_for_test, reverse=True)[0]-2:]
train_labels = labels[sorted(dates_for_test, reverse=True)[0]-2:]
test_features = np.atleast_2d(features[date_for_test_])
test_labels = labels[date_for_test_] if date_for_test != 0 else 1.0
rf = RanzomForestRegressor(bootstrap=False, criterion='mse', max_features=5, min_weight_fraction_leaf=0, n_jobs=1, oob_score=False, random_state=None, verbose=0, warm_start=False)
parameters = {"max_leaf_nodes": [2,5,10,15,20,25,30,35,40,45,50], "min_samples_leaf": [1,50,100,150,200,250,300,350,400,450,500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000], "min_samples_split": [2,50,100,150,200,250,300,350,400,450,500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000], 'n_estimators': [10, 100, 250, 500, 750, 1000, 1250, 1500, 1750, 2000, 2250, 2500, 2750, 3000, 3250, 3500, 3750, 4000, 4250, 4500, 4750, 5000, 5250, 5500, 5750, 6000, 6250, 6500, 6750, 7000, 7250, 7500, 7750, 8000, 8250, 8500, 8750, 9000, 9250, 9500, 9750, 10000], 'max_depth':[1,50,100,150,200,250,300,350,400,450,500, 550, 600, 650, 700, 750, 800, 850, 900, 950, 1000]}
grid_search = RandomizedSearchCV(cv=5, estimator=rf, param_distributions=parameters, n_iter=10, scoring=make_scorer(median_absolute_error))#, scoring=make_scorer(lambda x,y: my_custom_score(x, y, sorted(dates_for_test, reverse=True), features, labels), greater_is_better=False)))
grid_search.fit(train_features, train_labels)
rf = grid_search.best_estimator_
best_parameters=rf.get_params()
print ("best parameters")
for param_name in sorted(parameters.keys()):
print("\t%s: %r" % (param_name, best_parameters[param_name]))
predictions = rf.predict(test_features)
此外,使用当前的方法,我得到了对未来几个日期的样本外时间数据预测的相同连续值(图表上的不同颜色):
关于这个问题的文档非常详细,但我觉得它太详细了。我只是迷路了。也许有人可以指出正确的方向?