0

我尝试将梯度提升模型(弱学习器是 max.depth = 2 树)拟合到包中使用的数据iris集。我将迭代次数设置为,学习率为. 然后,我将结果与回归树的结果进行了比较(使用)。然而,回归树似乎优于梯度提升模型。这背后的原因是什么?以及如何提高梯度提升模型的性能?我认为 0.001 的学习率应该足够了 1000 次迭代/增强树。gbmgbmM = 1000learning.rate = 0.001rpart

library(rpart)
library(gbm)
data(iris)

train.dat <- iris[1:100, ]
test.dat <- iris[101:150, ]

learning.rate <- 0.001
M <- 1000
gbm.model <- gbm(Sepal.Length ~ ., data = train.dat, distribution = "gaussian", n.trees = M, 
    interaction.depth = 2, shrinkage = learning.rate, bag.fraction = 1, train.fraction = 1)
yhats.gbm <- predict(gbm.model, newdata = test.dat, n.trees = M)

tree.mod <- rpart(Sepal.Length ~ ., data = train.dat)
yhats.tree <- predict(tree.mod, newdata = test.dat)

> sqrt(mean((test.dat$Sepal.Length - yhats.gbm)^2))
[1] 1.209446
> sqrt(mean((test.dat$Sepal.Length - yhats.tree)^2))
[1] 0.6345438
4

1 回答 1

0

In the iris dataset, there are 3 different species, first 50 rows are setosa, next 50 are versicolor and last 50 are virginica. So I think it's better if you mix the rows, and also make the Species column relevant.

library(ggplot2)
ggplot(iris,aes(x=Sepal.Width,y=Sepal.Length,col=Species)) + geom_point()

enter image description here

Secondly, you should do this over different a few replicates to see its uncertainty. For this we can use caret, and we can define the training samples before hand and also provide a fixed grid. What we are interested in, is the error during the training with cross-validation, which is similar to what you are doing:

set.seed(999)
idx = split(sample(nrow(iris)),1:nrow(iris) %% 3)
tr = trainControl(method="cv",index=idx)
this_grid = data.frame(interaction.depth=2,shrinkage=0.001,
n.minobsinnode=10,n.trees=1000)

gbm_fit = train(Sepal.Width ~ . ,data=iris,method="gbm", distribution="gaussian",tuneGrid=tg,trControl=tr)

Then we use the same samples to fit rpart:

#the default for rpart
this_grid = data.frame(cp=0.01)
rpart_fit = train(Sepal.Width ~ . ,data=iris,method="rpart",
trControl=tr,tuneGrid=this_grid)

Finally we compare them, and they are very similar:

gbm_fit$resample
       RMSE  Rsquared       MAE Resample
1 0.3459311 0.5000575 0.2585884        0
2 0.3421506 0.4536114 0.2631338        1
3 0.3428588 0.5600722 0.2693837        2

       RMSE  Rsquared       MAE Resample
1 0.3492542 0.3791232 0.2695451        0
2 0.3320841 0.4276960 0.2550386        1
3 0.3284239 0.4343378 0.2570833        2

So I suspect there's something weird in the example above. Again it always depend on your data, for some data like for example iris, rpart might be good enough because there are very strong predictors. Also for complex models like gbm, you most likely need to train using something like the above to find the optimal parameters.

于 2020-04-02T23:10:09.313 回答