r - Rpart包生成的测试规则

Question

我想以编程方式测试从树生成的一条规则。在树中，根和叶（终端节点）之间的路径可以解释为规则。

在 R 中，我们可以使用rpart包并执行以下操作：（在这篇文章中，我将使用iris数据集，仅用于示例目的）

library(rpart)
model <- rpart(Species ~ ., data=iris)

有了这两行，我得到了一个名为的树model，它的类是rpart.object（rpart文档，第 21 页）。这个对象有很多信息，并且支持多种方法。特别是，该对象有一个frame变量（可以以标准方式访问：model$frame）（同上）和方法path.rpath（rpart文档，第 7 页），它为您提供从根节点到感兴趣节点的路径（node参数在功能）

变量的包含树row.names的frame节点号。该var列给出了节点中的分裂变量、yval拟合值和yval2类概率等信息。

> model$frame
           var   n  wt dev yval complexity ncompete nsurrogate     yval2.1     yval2.2     yval2.3     yval2.4     yval2.5     yval2.6     yval2.7
1 Petal.Length 150 150 100    1       0.50        3          3  1.00000000 50.00000000 50.00000000 50.00000000  0.33333333  0.33333333  0.33333333
2       <leaf>  50  50   0    1       0.01        0          0  1.00000000 50.00000000  0.00000000  0.00000000  1.00000000  0.00000000  0.00000000
3  Petal.Width 100 100  50    2       0.44        3          3  2.00000000  0.00000000 50.00000000 50.00000000  0.00000000  0.50000000  0.50000000
6       <leaf>  54  54   5    2       0.00        0          0  2.00000000  0.00000000 49.00000000  5.00000000  0.00000000  0.90740741  0.09259259
7       <leaf>  46  46   1    3       0.01        0          0  3.00000000  0.00000000  1.00000000 45.00000000  0.00000000  0.02173913  0.97826087

但只有列中标记为<leaf>终端var节点（叶子）。在这种情况下，节点是 2、6 和 7。

如上所述，您可以使用path.rpart提取规则的方法（此技术在 rattle包和文章Sharma Credit Score中使用，如下所示：

此外，该模型将预测值的值保留在

predicted.levels <- attr(model, "ylevels")

该值与数据集中的列相对yval应model$frame。

对于节点号为 7（行号为 5）的叶子，预测值为

> ylevels[model$frame[5, ]$yval]
[1] "virginica"

规则是

> rule <- path.rpart(model, nodes = 7)

 node number: 7 
   root
   Petal.Length>=2.45
   Petal.Width>=1.75

因此，该规则可以理解为

If Petal.Length >= 2.45 AND Petal.Width >= 1.75 THEN Species = Virginica

我知道我可以测试（在测试数据集中，我将再次使用 iris 数据集）这条规则有多少真阳性，对新数据集进行子集如下

> hits <- subset(iris, Petal.Length >= 2.45 & Petal.Width >= 1.75)

然后计算混淆矩阵

> table(hits$Species, hits$Species == "virginica")

             FALSE TRUE
  setosa         0    0
  versicolor     1    0
  virginica      0   45

（注：我使用了与测试相同的 iris 数据集）

我如何以编程方式评估规则？我可以从规则中提取条件如下

> unlist(rule, use.names = FALSE)[-1]
[1] "Petal.Length>=2.45" "Petal.Width>=1.75"

但是，我怎么能从这里继续呢？我无法使用该subset功能

提前致谢

注意： 为了更清晰，这个问题已经过大量编辑

score 3 · Accepted Answer

我可以通过以下方式解决这个问题

免责声明：显然必须是解决这个问题的更好方法，但是这个黑客可以工作并且可以做我想做的事......（我对此并不感到自豪......是hackish，但有效）

好的，让我们开始吧。基本上这个想法是使用包sqldf

如果您检查问题，最后一段代码将树路径的每一段放入一个列表中。所以，我将从那里开始

        library(sqldf)
        library(stringr)

        # Transform to a character vector
        rule.v <- unlist(rule, use.names=FALSE)[-1]
        # Remove all the dots, sqldf doesn't handles dots in names 
        rule.v <- str_replace_all(rule.v, pattern="([a-zA-Z])\\.([a-zA-Z])", replacement="\\1_\\2")
        # We have to remove all the equal signs to 'in ('
        rule.v <- str_replace_all(rule.v, pattern="([a-zA-Z0-9])=", replacement="\\1 in ('")
        # Embrace all the elements in the lists of values with " ' " 
        # The last element couldn't be modified in this way (Any ideas?) 
        rule.v <- str_replace_all(rule.v, pattern=",", replacement="','")

        # Close the last element with apostrophe and a ")" 
        for (i in which(!is.na(str_extract(pattern="in", string=rule.v)))) {
          rule.v[i] <- paste(append(rule.v[i], "')"), collapse="")
        }

        # Collapse all the list in one string joined by " AND "
        rule.v <- paste(rule.v, collapse = " AND ")

        # Generate the query
        # Use any metric that you can get from the data frame
        query <- paste("SELECT Species, count(Species) FROM iris WHERE ", rule.v, " group by Species", sep="")

        # For debug only...
        print(query)

        # Execute and print the results
        print(sqldf(query))

就这样！

我警告过你，这是骇人听闻的......

希望这对其他人有帮助...

感谢所有的帮助和建议！

score 2 · Accepted Answer

一般来说，我不建议使用eval(parse(...))，但在这种情况下它似乎有效：

提取规则：

rule <- unname(unlist(path.rpart(model, nodes=7)))[-1]

 node number: 7 
   root
   Petal.Length>=2.45
   Petal.Width>=1.75
rule
[1] "Petal.Length>=2.45" "Petal.Width>=1.75"

使用规则提取数据：

node_data <- with(iris, iris[eval(parse(text=paste(rule, collapse=" & "))), ])
head(node_data)

    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species
71           5.9         3.2          4.8         1.8 versicolor
101          6.3         3.3          6.0         2.5  virginica
102          5.8         2.7          5.1         1.9  virginica
103          7.1         3.0          5.9         2.1  virginica
104          6.3         2.9          5.6         1.8  virginica
105          6.5         3.0          5.8         2.2  virginica

score 1 · Accepted Answer

从...开始

Rule number: 16 [yval=bad cover=220 N=121 Y=99 (37%) prob=0.04]
checking< 2.5
afford< 54
history< 3.5
coapp< 2.5

您将有一个以全零开始的“概率”向量，您可以使用 rule16 对其进行更新：

prob <- ifelse( dat[['checking']] < 2.5 &
                dat[['afford']]  < 54
                dat[['history']] < 3.5
                dat[['coapp']]  < 2.5) , 0.04, prob )

然后，您需要遍历所有其他规则（这不应该改变这种情况下的任何概率，因为树应该是不相交的估计。）可能有比这更有效的方法来构建预测。例如……predict.rpart函数。

r - Rpart包生成的测试规则

3 回答 3

Related

Reference