4

(previously posted here, to the wrong sub, with not enough info, which was closed, I edited, the edits seem to have been deleted, & the post consigned to purgatory, so apologies for re-posting, I don't know whether the previous post can/should be resurrected)

In R, I've run some Boosted Regression Trees, aka Generalized Boosting Models, using dismo which uses gbm. Reproducible example to get people to where I am currently:

library(dismo); data(Anguilla_train)
angaus.tc5.lr01 <- gbm.step(data=Anguilla_train, gbm.x = 3:13, gbm.y = 2, family = "bernoulli", tree.complexity = 5, learning.rate = 0.01, bag.fraction = 0.5)

(From here). This leaves you with gbm model object "angaus.tc5.lr01". I'd like to generate dendrograms of the splits (folds?), i.e. plot the trees, as per De'ath 2007 (see pic, left pane). BUT: De'ath's plot is of a single regression tree, not a boosted regression tree which is the average of potentially thousands of trees each run with a different set of data randomly drawn from the dataset.

User ckluss kindly suggested rpart, however that needs the model to be generated by rpart so doesn't work for BRTs/GBMs produced by gbm.step. The same is true of prp from rpart.plot.

pretty.gbm.tree in gbm extracts a matrix of info for any one tree selected (try pretty.gbm.tree(angaus.tc5.lr01, i.tree=1) for the first) so I'm wondering if this might be a plausible route to success? E.g. by writing some script which creates an averaged tree matrix using all of the available trees, then converting this into a tree-like object, possibly using some of the methods here.

People have asked varyingly similar questions seemingly with no success elsewhere on the net. BRT models are regularly described as being 'black boxes' so maybe the general opinion is that one shouldn't need/be able/bother to probe into them and display their inner processes.

If anyone knows enough about BRTs / gbm and has any ideas, they'd be gratefully received. Thanks.

De'ath tree diagram

4

1 回答 1

2

正如您所指出的,对决策树集合的解释比解释单个树要困难得多。在几何上,您可以将决策树集成视为复杂的高维曲面的近似值。目标是找到有助于近似的变量,并可视化它们的影响。

解释集成的基本思想不是获得“平均”树,或获得任何单个树的图,而是可视化变量的“平均”效果。在文献中,这是预测变量的“部分依赖”——它的效果是使其他变量保持不变。“部分依赖”是如何估计的,描述起来有点复杂,但它是通过仅允许预测变量j变化获得的模型隐含预测,用于观察i然后对所有i个观察结果进行平均预测。有关血腥细节,请参见Friedman & Popescue (2008)

然后,您可以根据预测变量的实际值绘制预测变量的估计依赖性(或我所说的“隐含模型”)效应。这让您可以看到预测变量的模型隐含效果。

好消息是这样的地块dismo很容易获得。参见gbm.plot单个预测变量,以及gbm.perspec涉及两个预测变量的透视图。小插图还提供了示例。为了进一步帮助解释模型,gbm.interactions提供了一种检测可能的 2 或 3 向交互的方法。有关详细信息,请参阅此问题。

于 2015-09-24T14:00:14.787 回答