2


关于 R 树模型的快速问题。我想在很多变量(主要是数字或因子变量)上生成一个树模型。其中一个变量是Gender,其中类别为male、femaleunknown。当我使用and库中的treeorrpart函数时,我只从Gender根中得到两个分支。未知性别已与女性归为一个类别。所以我得到的分支是Female+UnknownMale。我检查了树包pdf http://cran.r-project.org/web/packages/tree/tree.pdf它说treerpart无序因子的水平分为两个非空组。rpart 函数在处理超过 2 个级别的因素方面似乎与树函数非常相似。

因此,我的问题是R 中是否还有其他功能或包可以让我从单个节点生成 3 个以上的分支,或者是否有人对其他开源工具有任何建议可以做同样的事情。如果您需要更多信息,请告诉我。

4

1 回答 1

2

rpart()完全能够处理超过 2 个类别的响应。尝试:

require(rpart)
mod <- rpart(Species ~ ., data = iris)
mod
plot(mod)
text(mod)

当使用默认设置运行时,它会生成一个具有 3 个终端节点的树:

R> mod
n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

    1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)  
      2) Petal.Length< 2.45 50   0 setosa (1.00000000 0.00000000 0.00000000) *
      3) Petal.Length>=2.45 100  50 versicolor (0.00000000 0.50000000 0.50000000)  
        6) Petal.Width< 1.75 54   5 versicolor (0.00000000 0.90740741 0.09259259) *
        7) Petal.Width>=1.75 46   1 virginica (0.00000000 0.02173913 0.97826087) *

The recursive partitioning algorithm will stop building a tree when certain stopping rules are met (there is no point splitting if a node is already pure [of a single class], and by default a node has to have 20+ observations for it to be split, and will also stop splitting a given node if it has less than 7 observations, or if no further splits will improve the lack of fit by a factor of 0.01, and so on). Some of these can be controlled from the rpart.control() function.

From what limited information you have given us, I can only conclude that these defaults are inappropriate for your data set and you should adjust them accordingly, e.g.:

ctrl <- rpart.control(minsplit = 2, minbucket = 1, cp = 0.00001)
mod2 <- rpart(Species ~ ., data = iris, control = ctrl)
mod2
plot(mod2)
text(mod2)

Which for this exmaple data set produces a much larger tree:

R>     mod2
n= 150 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 150 100 setosa (0.33333333 0.33333333 0.33333333)  
   2) Petal.Length< 2.45 50   0 setosa (1.00000000 0.00000000 0.00000000) *
   3) Petal.Length>=2.45 100  50 versicolor (0.00000000 0.50000000 0.50000000)  
     6) Petal.Width< 1.75 54   5 versicolor (0.00000000 0.90740741 0.09259259)  
      12) Petal.Length< 4.95 48   1 versicolor (0.00000000 0.97916667 0.02083333)  
        24) Petal.Width< 1.65 47   0 versicolor (0.00000000 1.00000000 0.00000000) *
        25) Petal.Width>=1.65 1   0 virginica (0.00000000 0.00000000 1.00000000) *
      13) Petal.Length>=4.95 6   2 virginica (0.00000000 0.33333333 0.66666667)  
        26) Petal.Width>=1.55 3   1 versicolor (0.00000000 0.66666667 0.33333333)  
          52) Sepal.Length< 6.95 2   0 versicolor (0.00000000 1.00000000 0.00000000) *
          53) Sepal.Length>=6.95 1   0 virginica (0.00000000 0.00000000 1.00000000) *
        27) Petal.Width< 1.55 3   0 virginica (0.00000000 0.00000000 1.00000000) *
     7) Petal.Width>=1.75 46   1 virginica (0.00000000 0.02173913 0.97826087)  
      14) Petal.Length< 4.85 3   1 virginica (0.00000000 0.33333333 0.66666667)  
        28) Sepal.Length< 5.95 1   0 versicolor (0.00000000 1.00000000 0.00000000) *
        29) Sepal.Length>=5.95 2   0 virginica (0.00000000 0.00000000 1.00000000) *
      15) Petal.Length>=4.85 43   0 virginica (0.00000000 0.00000000 1.00000000) *

but is most likely highly over-fitted to the data.

That said, there are, of course, other packages that can fit trees to data sets that like rpart() can handle response with more than two levels. The main ones are listed on the Machine Learning & Statistical Learning Task View on CRAN, which you should consult. One such package is party.

于 2012-09-21T08:40:55.833 回答