r - 如何（合并）组合由多个估算数据集的回归算法产生的 RMSE 值

Question

我有一个缺少值的小数据集（280 行）。我使用了多个插补（小鼠包，m=5）来插补我的数据集。

然后，我使用 10 倍交叉验证对每个估算数据集应用了不同的回归算法（即 SVM、rpart..等）。我将使用生成的 RMSE（均方根误差）值来比较回归算法。

问题是由于数据集已被估算了 5 次，因此我将为每个特定算法最终得到 5 种 RMSE 方法，我的问题是如何将属于一个算法的这五个 RMSE 组合起来？所以我可以进行算法之间的比较。换句话说，我想计算平均系数，我知道 pool() 函数可以做到这一点，但我不确定我是否可以将它与机器学习一起使用，例如 SVM 和随机森林。

我想到的一种解决方案是使用长格式组合所有数据帧，然后应用我的算法，我最终会得到一个均值 RMSE，但我担心过拟合问题，因为长格式可能有重复记录，如果有，请纠正我我错了？

非常感谢你，希望你能帮助我。

以下是我的代码。

x <- data 
form <- data$target
fold <- 10  # number of fold for cross validation

imp <- mice(x, meth = "pmm", m=5) # Imputation using mice pmm (5 iteration)

impSetsVector <- list(); # will include the 5 imputed sets
for(i in seq(5))
{
  impSetsVector[[i]] <- complete(imp, action = i, include = FALSE)
}


## Next I Applied RandomForest using 10 fold cross validation to each imputed set
## I Computed rmse for each dataset

avg.rmse <- matrix(data = NA,nrow=10, ncol=1) # include the mean of rmse for each imputed dataset.

for(j in seq(5))  # as we have 5 imputed dataset
{
  x <- impSetsVector[[j]] # x will include the j iteration of imputed dataset
n <- nrow(x)
prop <- n%/%fold
set.seed(7)
newseq <- rank(runif(n))
k <- as.factor((newseq - 1)%/%prop + 1)
y <- unlist(strsplit(as.character(form), " "))[2] 
vec.error <- vector(length = fold)
## start modeling with 10 fold cross validation
for (i in seq(fold)) {
  # Perfrom RandomForest method
  fit <- randomForest(form ~., data = x[k != i, ],ntree=500,keep.forest=TRUE,importance=TRUE,na.action = na.omit)

  fcast <- predict(fit, newdata = x[k == i, ]) # predict using test set
  rmse <-  sqrt(mean((x[k == i, ]$y - fcast)^2)) 
  vec.error[i] <- rmse # rmse for test set
}# end of the inner loop

avg.rmse[j] <- mean(vec.error) ## The mean of 10 rmse 

}#end of loop

r - 如何（合并）组合由多个估算数据集的回归算法产生的 RMSE 值

0 回答 0

Related

Reference