0

我保存了使用 R 中的 rpart 包创建的模型。我试图从这些保存的模型中检索一些信息;特别是来自rpart.object。虽然文档 - rpart doc - 很有帮助,但仍有一些不清楚的地方:

  1. 如何找出哪些变量是分类的,哪些是数字的?目前,我所做的是参考拆分矩阵中的“索引”列。我注意到仅对于数字变量,条目不是整数。有没有更清洁的方法来做到这一点?
  2. csplit矩阵指的是分类变量可以使用整数取的各种值,即 R 将原始名称映射到整数。有没有办法访问这个映射?例如。如果我的原始变量Country可以采用任何值France, Germany, Japan等,则 csplit 矩阵让我知道某个拆分基于Country == 1, 2. 在这里,rpart 分别替换了对France, Germanywith 的引用1, 2。如何France, Germany, Japan从模型文件中获取原始名称?另外,我怎么知道名称和整数之间的映射是什么?
4

1 回答 1

2

通常,它是terms具有此类信息的组件。See ?rpart::rpart.object.

fit <- rpart::rpart(Kyphosis ~ Age + Number + Start, data = kyphosis)
fit$terms  # notice that the attribute dataClasses has the information
attr(fit$terms, "dataClasses")
#------------
 Kyphosis       Age    Number     Start 
 "factor" "numeric" "numeric" "numeric" 

该示例的结构中没有 csplit 节点,因为所有 hte 变量都不是因子。你可以很容易地制作一个:

> fit <- rpart::rpart(Kyphosis ~ Age + factor(findInterval(Number,c(0,4,6,Inf))) + Start, data = kyphosis)
> fit$csplit
     [,1] [,2] [,3]
[1,]    1    1    3
[2,]    1    1    3
[3,]    3    1    3
[4,]    1    3    3
[5,]    3    1    3
[6,]    3    3    1
[7,]    3    1    3
[8,]    1    1    3
> attr(fit$terms, "dataClasses")
                                     Kyphosis 
                                     "factor" 
                                          Age 
                                    "numeric" 
factor(findInterval(Number, c(0, 4, 6, Inf))) 
                                     "factor" 
                                        Start 
                                    "numeric" 

整数只是因子变量的值,因此“映射”与从因子到因子的映射as.numeric()相同levels()。如果我试图构建一个字符矩阵版本的fit$csplit-matrix 替换因子变量中的级别名称,这将是成功的一种途径:

> kyphosis$Numlev <- factor(findInterval(kyphosis$Number, c(0, 4, 6, Inf)), labels=c("low","med","high"))
> str(kyphosis)
'data.frame':   81 obs. of  5 variables:
 $ Kyphosis: Factor w/ 2 levels "absent","present": 1 1 2 1 1 1 1 1 1 2 ...
 $ Age     : int  71 158 128 2 1 1 61 37 113 59 ...
 $ Number  : int  3 3 4 5 4 2 2 3 2 6 ...
 $ Start   : int  5 14 5 1 15 16 17 16 16 12 ...
 $ Numlev  : Factor w/ 3 levels "low","med","high": 1 1 2 2 2 1 1 1 1 3 ...
> fit <- rpart::rpart(Kyphosis ~ Age +Numlev + Start, data = kyphosis)
> Levels <- fit$csplit
> Levels[] <- levels(kyphosis$Numlev)[Levels]
> Levels
     [,1]   [,2]   [,3]  
[1,] "low"  "low"  "high"
[2,] "low"  "low"  "high"
[3,] "high" "low"  "high"
[4,] "low"  "high" "high"
[5,] "high" "low"  "high"
[6,] "high" "high" "low" 
[7,] "high" "low"  "high"
[8,] "low"  "low"  "high"

对评论的回应:如果您只有模型,则使用 str() 来查看它。我在创建的示例中看到了一个“有序”叶子,其因子标签存储在名为“xlevels”的属性中:

$ ordered            : Named logi [1:3] FALSE FALSE FALSE
  ..- attr(*, "names")= chr [1:3] "Age" "Numlev" "Start"
 - attr(*, "xlevels")=List of 1
  ..$ Numlev: chr [1:3] "low" "med" "high"
 - attr(*, "ylevels")= chr [1:2] "absent" "present"
 - attr(*, "class")= chr "rpart"
于 2015-04-05T16:34:04.510 回答