I found a solution to the problem at hand, so I write this answer if anyone might run into the same issue. I'll describe my process, so it might be a bit rambling, so bear with me.
With no clue, I thought about nuking slots and removing weights to get the objects as small as possible and at least save some memory, in case no fix will be found. So I removed @data
and @responses
as a start and prediction went still fine without them, yet no effect on the .RData file size.
I the went the other way round and created and empty ctree model, just pluging the tree into it:
> library(party)
## create reference predictions for the dataset
> predictions.org <- treeresponse(c1, d)
## save tree object for reference
save(c1, "testSize_c1.RData")
Checking the size of the original object:
$ ls -lh testSize_c1.RData
-rw-r--r-- 1 user user 9.6M 2011-08-25 14:35 testSize_c1.RData
Now, let's create an empty CTree and copy the tree only:
## extract the tree only
> c1Tree <- c1@tree
## create empty tree and plug in the extracted one
> newCTree <- new("BinaryTree")
> newCTree@tree <- c1Tree
## save tree for reference
save(newCTree, file="testSize_newCTree.RData")
This new tree object is now much smaller:
$ ls -lh testSize_newCTree.RData
-rw-r--r-- 1 user user 108K 2011-08-25 14:35 testSize_newCTree.RData
However, it can't be used to predict:
## predict with the new tree
> predictions.new <- treeresponse(newCTree, d)
Error in object@cond_distr_response(newdata = newdata, ...) :
unused argument(s) (newdata = newdata)
We did not set the @cond_distr_response
, which might cause the error, so copy the original one as well and try to predict again:
## extract cond_distr_response from original tree
> cdr <- c1@cond_distr_response
> newCTree@cond_distr_response <- cdr
## save tree for reference
save(newCTree, file="testSize_newCTree_with_cdr.RData")
## predict with the new tree
> predictions.new <- treeresponse(newCTree, d)
## check correctness
> identical(predictions.org, predictions.new)
[1] TRUE
This works perfectly, but now the size of the RData file is back at its original value:
$ ls -lh testSize_newCTree_with_cdr.RData
-rw-r--r-- 1 user user 9.6M 2011-08-25 14:37 testSize_newCTree_with_cdr.RData
Simply printing the slot, shows it to be a function bound to an environment:
> c1@cond_distr_response
function (newdata = NULL, mincriterion = 0, ...)
{
wh <- RET@get_where(newdata = newdata, mincriterion = mincriterion)
response <- object@responses
if (any(response@is_censored)) {
swh <- sort(unique(wh))
RET <- vector(mode = "list", length = length(wh))
resp <- response@variables[[1]]
for (i in 1:length(swh)) {
w <- weights * (where == swh[i])
RET[wh == swh[i]] <- list(mysurvfit(resp, weights = w))
}
return(RET)
}
RET <- .Call("R_getpredictions", tree, wh, PACKAGE = "party")
return(RET)
}
<environment: 0x44e8090>
So the answer to the initial question appears to be that the methods of the object bind an environment to it, which is then saved with the object in the corresponding RData file. This might also explain why several packages are loaded when the RData file is read.
Thus, to get rid of the environment, we can't copy the methods, but we can't predict without them either. The rather "dirty" solution is to emulate the functionality of the original methods and call the underlying C code directly. After some digging through the source code, this is indeed possible. As the code copied above suggests, we need to call get_where
, which determines the terminal node of the tree reached by the input. We then need to call R_getpredictions
to determine the response from that terminal node for each input sample. The tricky part is that we need to get the data in the right input format and thus have to call the data preprocessing included in ctree:
## create a character string of the formula which was used to fit the free
## (there might be a more neat way to do this)
> library(stringr)
> org.formula <- str_c(
do.call(str_c, as.list(deparse(c1@data@formula$response[[2]]))),
"~",
do.call(str_c, as.list(deparse(c1@data@formula$input[[2]]))))
## call the internal ctree preprocessing
> data.dpp <- party:::ctreedpp(as.formula(org.formula), d)
## create the data object necessary for the ctree C code
> data.ivf <- party:::initVariableFrame.df(data.dpp@menv@get("input"),
trafo = ptrafo)
## now call the tree traversal routine, note that it only requires the tree
## extracted from the @tree slot, not the whole object
> nodeID <- .Call("R_get_nodeID", c1Tree, data.ivf, 0, PACKAGE = "party")
## now determine the respective responses
> predictions.syn <- .Call("R_getpredictions", c1Tree, nodeID, PACKAGE = "party")
## check correctness
> identical(predictions.org, predictions.syn)
[1] TRUE
We now only need to save the extracted tree and the formula string to be able to predict new data:
> save(c1Tree, org.formula, file="testSize_extractedObjects.RData")
We can further remove the unnecessary weights as described in the updated question above:
> .Call("R_remove_weights", c1Tree, TRUE, PACKAGE="party")
> save(c1Tree, org.formula, file="testSize_extractedObjects__removedWeights.RData")
Now let's have a look at the file sizes again:
$ ls -lh testSize_extractedObjects*
-rw-r--r-- 1 user user 109K 2011-08-25 15:31 testSize_extractedObjects.RData
-rw-r--r-- 1 user user 43K 2011-08-25 15:31 testSize_extractedObjects__removedWeights.RData
Finally, instead of (compressed) 9.6M, only 43K are required to use the model. I should now be able to fit as many as I want in my 3G heap space. Hooray!