我正在尝试使用该party
包运行随机森林模型。我的响应变量(10 个级别)是不同湖泊类型的分类值(感兴趣的是哪些因素会影响基于水质属性的湖泊聚类)。我的预测变量包括连续变量和分类变量。一个分类变量有 4 个级别,另一个分类变量有 8 个级别(美国州湖泊所在的州)。每当我在模型中包含第二个分类变量时,我都会收到以下错误:
Error in model@fit(data, ...) : error code 1 from Lapack routine 'dgesdd'.
我已经能够将其范围缩小到当预测变量具有超过 4 个分类级别时包中的cforest
例程party
似乎没有运行的事实。我不确定这是否适用于其他数据集或只是我的一个特征。谷歌建议错误代码可能与收敛问题有关。有没有人知道cforest
例程中关于分类预测级别的限制(例如randomForest
,从randomForest
包中限制为 32 个级别)?我还没有看到任何明确的讨论party
包裹。一种解决方案是将这个因素重新编码为单独的虚拟变量,但我想避免这种情况。根据我的数据的特征(相关预测变量、不同水平的因素、连续数据和分类数据的混合),cforest
似乎推荐超过randomForest
.
任何见解将不胜感激。
链接到一个虚拟数据集(真实数据只是有限数量的变量):https ://dl.dropboxusercontent.com/u/8554679/newdata.csv
library(RCurl)
library(party)
x = getURL("https://dl.dropboxusercontent.com/u/8554679/newdata.csv")
new.data = read.csv(text = x,header=TRUE)
new.data$response = as.factor(new.data$response)
new.data$factor1 = as.factor(new.data$factor1)
new.data$factor2 = as.factor(new.data$factor2)
set.seed(1123)
data.controls = data.controls = cforest_unbiased(ntree=500, mtry=3)
data.cforest = cforest(response ~ factor1 + pred1 + pred2 + pred3 + pred4 + factor2 + pred5 + pred6 + pred7,data=new.data,controls=data.controls)
#excuting this results in the following error: Error in model@fit(data, ...) : error code 1 from Lapack routine 'dgesdd'
#remove factor2 which has 8 levels from the formula
data.cforest = cforest(response ~ factor1 + pred1 + pred2 + pred3 + pred5 + pred6 + pred7,data=new.data,controls=data.controls)
levels(new.data$factor2)
#arbitrarily reassign factor2 levels such that there are only 4 levels
#I've tried levels between 8 and 4 and it turns out it only works if factors have <=4 levels
random.rows = sample(x=c(1:nrow(new.data)),size=nrow(new.data),replace=FALSE)
new.data$factor2 = NA
new.data$factor2[random.rows[1:120]] = 1
new.data$factor2[random.rows[121:241]] = 2
new.data$factor2[random.rows[242:362]] = 3
new.data$factor2[random.rows[363:483]] = 4
new.data$factor2 = as.factor(new.data$factor2)
levels(new.data$factor2)
data.cforest = cforest(response ~ factor1 + pred1 + pred2 + pred3 + pred4 + factor2 + pred5 + pred6 + pred7,data=new.data,controls=data.controls)
#model runs fine.
SessionInfo() 请求:
sessionInfo()
R version 3.0.3 (2014-03-06)
Platform: x86_64-apple-darwin10.8.0 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats4 grid stats graphics grDevices utils datasets methods base
other attached packages:
[1] party_1.0-13 modeltools_0.2-21 strucchange_1.5-0 sandwich_2.3-0 zoo_1.7-11 RCurl_1.95-4.1
[7] bitops_1.0-6
loaded via a namespace (and not attached):
[1] coin_1.0-23 lattice_0.20-29 mvtnorm_0.9-99992 splines_3.0.3 survival_2.37-7 tools_3.0.3