9

I am working on Random Forest classification.

I found that cforest in "party" package usually performs better than "randomForest".
However, it seemed that cforest easily overfitted.

A toy example

Here is a random data set that includes response of binary factor and 10 numerical variables generated from rnorm().

# Sorry for redundant preparation.
data <- data.frame(response=rnorm(100))
data$response <- factor(data$response < 0)
data <- cbind(data, matrix(rnorm(1000), ncol=10))
colnames(data)[-1] <- paste("V",1:10,sep="")

Perform cforest, employing unbiased parameter set (maybe recommended).

cf <- cforest(response ~ ., data=data, controls=cforest_unbiased())
table(predict(cf), data$response)
#       FALSE TRUE
# FALSE    45    7
# TRUE      6   42

Fairly good prediction performance on meaningless data.

On the other hand, randomForest goes honestly.

rf <- randomForest(response ~., data=data)
table(predict(rf),data$response)
#       FALSE TRUE
# FALSE    25   27
# TRUE     26   22

Where these differences come from?
I am afraid that I am using cforest in a wrong way.

Let me put some extra observations in cforest:

  1. The number of variables did not much affect the result.
  2. Variable importance values (computed by varimp(cf)) were rather low, compared to those using some realistic explanatory variables.
  3. AUC of ROC curve was nearly 1.

I would appreciate your advices.

Additional note

Some wondered why a training data set was applied to the predict().
I did not prepare any test data set because the prediction was done for OOB samples, which was not true for cforest.
c.f. http://www.stat.berkeley.edu/~breiman/RandomForests/cc_home.htm

4

2 回答 2

9

您无法通过研究分类器在训练集上的性能来了解分类器的真实性能。此外,由于没有真正的模式可以找到,因此您无法真正判断像 did 那样过度拟合cforest还是像 did 那样随机猜测是否更糟randomForest。你只能说这两种算法遵循不同的策略,但如果你在新的看不见的数据上测试它们,两者都可能会失败。

估计分类器性能的唯一方法是在外部数据上测试它,这不是训练的一部分,在你知道有一个模式要找到的情况下。

一些评论:

  1. 如果没有包含任何有用的信息,变量的数量应该无关紧要。
  2. 很高兴看到无意义数据的变量重要性低于有意义的数据。这可以作为该方法的健全性检查,但可能不多。
  3. AUC(或任何其他性能度量)在训练集上并不重要,因为获得完美的分类结果是微不足道的。
于 2013-10-23T14:15:55.940 回答
3

这些predict方法分别具有不同的默认值cforestrandomForest模型。party:::predict.RandomForest得到你

function (object, OOB = FALSE, ...) 
    {
        RandomForest@predict(object, OOB = OOB, ...)
    }

所以

table(predict(cf), data$response)

得到我

        FALSE TRUE
  FALSE    45   13
  TRUE      7   35

然而

table(predict(cf, OOB=TRUE), data$response)

得到我

        FALSE TRUE
  FALSE    31   24
  TRUE     21   24

这是一个相当惨淡的结果。

于 2013-10-23T16:01:48.157 回答