r - 如何从party:::ctree 模型中删除训练数据？

Question

我创建了几个我想要经常评估的 ctree 模型（大约 40 到 80 个）。

一个问题是模型对象非常大（40 个模型需要超过 2.8G 的内存），在我看来，它们存储了训练数据，可能是 modelname@data 和 modelname@responses，而不仅仅是相关的信息预测新数据。

大多数其他 R 学习包都有可配置的选项，是否将数据包含在模型对象中，但我在文档中找不到任何提示。我还尝试通过分配空的 ModelEnv 对象

modelname@data <- new("ModelEnv")

但对相应 RData 文件的大小没有影响。

任何人都知道 ctree 是否真的存储了训练数据，以及如何从 ctree 模型中删除与新预测无关的所有数据，以便我可以将其中的许多数据放入内存中？

非常感谢，

斯特凡

感谢您的反馈，这已经非常有帮助了。

我使用dputandstr来更深入地查看对象，发现模型中没有包含任何训练数据，但是有一个responses插槽，其中似乎有训练标签和行名。无论如何，我注意到每个节点对于每个训练样本都有一个权重向量。在检查了一段时间代码后，我在谷歌上搜索了一下，在partyNEWS 日志中发现了以下评论：

         CHANGES IN party VERSION 0.9-13 (2007-07-23)

o   update `mvt.f'

o   improve the memory footprint of RandomForest objects
    substancially (by removing the weights slots from each node).

事实证明，party 包中有一个 C 函数可以删除这些权重，调用R_remove_weights的定义如下：

SEXP R_remove_weights(SEXP subtree, SEXP removestats) {
    C_remove_weights(subtree, LOGICAL(removestats)[0]);
    return(R_NilValue);
}

它也可以正常工作：

# cc is my model object

sum(unlist(lapply(slotNames(cc), function (x)  object.size(slot(cc, x)))))
# returns: [1] 2521256
save(cc, file="cc_before.RData")

.Call("R_remove_weights", cc@tree, TRUE, PACKAGE="party")
# returns NULL and removes weights and node statistics

sum(unlist(lapply(slotNames(cc), function (x)  object.size(slot(cc, x)))))
# returns: [1] 1521392
save(cc, file="cc_after.RData")

如您所见，它大大减小了对象大小，从大约 2.5MB 减少到 1.5MB。

但奇怪的是，相应的 RData 文件非常庞大，而且对它们没有任何影响：

$ ls -lh cc*
-rw-r--r-- 1 user user 9.6M Aug 24 15:44 cc_after.RData
-rw-r--r-- 1 user user 9.6M Aug 24 15:43 cc_before.RData

解压文件显示 2.5MB 的对象占用了将近 100MB 的空间：

$ cp cc_before.RData cc_before.gz
$ gunzip cc_before.gz 
$ ls -lh cc_before*
-rw-r--r-- 1 user user  98M Aug 24 15:45 cc_before

有什么想法，可能是什么原因造成的？

score 5 · Accepted Answer

I found a solution to the problem at hand, so I write this answer if anyone might run into the same issue. I'll describe my process, so it might be a bit rambling, so bear with me.

With no clue, I thought about nuking slots and removing weights to get the objects as small as possible and at least save some memory, in case no fix will be found. So I removed @data and @responses as a start and prediction went still fine without them, yet no effect on the .RData file size.

I the went the other way round and created and empty ctree model, just pluging the tree into it:

> library(party)

## create reference predictions for the dataset
> predictions.org <- treeresponse(c1, d)

## save tree object for reference
save(c1, "testSize_c1.RData")

Checking the size of the original object:

$ ls -lh testSize_c1.RData 
-rw-r--r-- 1 user user 9.6M 2011-08-25 14:35 testSize_c1.RData

Now, let's create an empty CTree and copy the tree only:

## extract the tree only 
> c1Tree <- c1@tree

## create empty tree and plug in the extracted one 
> newCTree <- new("BinaryTree")
> newCTree@tree <- c1Tree

## save tree for reference 
save(newCTree, file="testSize_newCTree.RData")

This new tree object is now much smaller:

$ ls -lh testSize_newCTree.RData 
-rw-r--r-- 1 user user 108K 2011-08-25 14:35 testSize_newCTree.RData

However, it can't be used to predict:

## predict with the new tree
> predictions.new <- treeresponse(newCTree, d)
Error in object@cond_distr_response(newdata = newdata, ...) : 
  unused argument(s) (newdata = newdata)

We did not set the @cond_distr_response, which might cause the error, so copy the original one as well and try to predict again:

## extract cond_distr_response from original tree
> cdr <- c1@cond_distr_response
> newCTree@cond_distr_response <- cdr

## save tree for reference 
save(newCTree, file="testSize_newCTree_with_cdr.RData")

## predict with the new tree
> predictions.new <- treeresponse(newCTree, d)

## check correctness
> identical(predictions.org, predictions.new)
[1] TRUE

This works perfectly, but now the size of the RData file is back at its original value:

$ ls -lh testSize_newCTree_with_cdr.RData 
-rw-r--r-- 1 user user 9.6M 2011-08-25 14:37 testSize_newCTree_with_cdr.RData

Simply printing the slot, shows it to be a function bound to an environment:

> c1@cond_distr_response
function (newdata = NULL, mincriterion = 0, ...) 
{
    wh <- RET@get_where(newdata = newdata, mincriterion = mincriterion)
    response <- object@responses
    if (any(response@is_censored)) {
        swh <- sort(unique(wh))
        RET <- vector(mode = "list", length = length(wh))
        resp <- response@variables[[1]]
        for (i in 1:length(swh)) {
            w <- weights * (where == swh[i])
            RET[wh == swh[i]] <- list(mysurvfit(resp, weights = w))
        }
        return(RET)
    }
    RET <- .Call("R_getpredictions", tree, wh, PACKAGE = "party")
    return(RET)
}
<environment: 0x44e8090>

So the answer to the initial question appears to be that the methods of the object bind an environment to it, which is then saved with the object in the corresponding RData file. This might also explain why several packages are loaded when the RData file is read.

Thus, to get rid of the environment, we can't copy the methods, but we can't predict without them either. The rather "dirty" solution is to emulate the functionality of the original methods and call the underlying C code directly. After some digging through the source code, this is indeed possible. As the code copied above suggests, we need to call get_where, which determines the terminal node of the tree reached by the input. We then need to call R_getpredictions to determine the response from that terminal node for each input sample. The tricky part is that we need to get the data in the right input format and thus have to call the data preprocessing included in ctree:

## create a character string of the formula which was used to fit the free
## (there might be a more neat way to do this)
> library(stringr)
> org.formula <- str_c(
                   do.call(str_c, as.list(deparse(c1@data@formula$response[[2]]))),
                   "~", 
                   do.call(str_c, as.list(deparse(c1@data@formula$input[[2]]))))

## call the internal ctree preprocessing 
> data.dpp <- party:::ctreedpp(as.formula(org.formula), d)

## create the data object necessary for the ctree C code
> data.ivf <- party:::initVariableFrame.df(data.dpp@menv@get("input"), 
                                           trafo = ptrafo)

## now call the tree traversal routine, note that it only requires the tree
## extracted from the @tree slot, not the whole object
> nodeID <- .Call("R_get_nodeID", c1Tree, data.ivf, 0, PACKAGE = "party")

## now determine the respective responses
> predictions.syn <- .Call("R_getpredictions", c1Tree, nodeID, PACKAGE = "party")

## check correctness
> identical(predictions.org, predictions.syn)
[1] TRUE

We now only need to save the extracted tree and the formula string to be able to predict new data:

> save(c1Tree, org.formula, file="testSize_extractedObjects.RData")

We can further remove the unnecessary weights as described in the updated question above:

> .Call("R_remove_weights", c1Tree, TRUE, PACKAGE="party")
> save(c1Tree, org.formula, file="testSize_extractedObjects__removedWeights.RData")

Now let's have a look at the file sizes again:

$ ls -lh testSize_extractedObjects*
-rw-r--r-- 1 user user 109K 2011-08-25 15:31 testSize_extractedObjects.RData
-rw-r--r-- 1 user user  43K 2011-08-25 15:31 testSize_extractedObjects__removedWeights.RData

Finally, instead of (compressed) 9.6M, only 43K are required to use the model. I should now be able to fit as many as I want in my 3G heap space. Hooray!

score 1 · Accepted Answer

您正在寻找的是删除插槽。party提醒一句：考虑到函数如何与对象一起工作，这可能相当危险。

尽管如此，看看slotNames(yourModel)。您还可以尝试object.size(slot(yourModel), slotNameOfInterest)检查不同插槽的大小。您可以轻松地创建一个排序表来确定每个插槽中对象的大小。

在任何情况下，插槽data都是一个ModelEnvFormula（我将称之为“MEF”）对象。您可以创建一个虚拟 MEF:dummyMEF <- ModelEnvFormula(1 ~ 1)然后将其分配给data: slot(yourModel, "data") <- dummyMEF。

这将破坏那个特定的插槽。您应该看看是否有其他插槽在存储方面引起头痛 - 该object.size()功能将提供帮助。我同意能够从模型对象中省略训练数据是件好事。

r - 如何从party:::ctree 模型中删除训练数据？

2 回答 2

Related

Reference