r - R - 具有估计标准误差的线性回归的 k 折交叉验证

Question

我想在 R 中为线性回归模型执行 k 折交叉验证并测试一个标准错误规则：

https://stats.stackexchange.com/questions/17904/one-standard-error-rule-for-variable-selection

因此，我需要一个函数来返回预测误差的交叉验证估计和这个估计的标准误差（或至少每个折叠的 MSE，这样我就可以自己计算标准误差）。许多包都有计算交叉验证误差的函数（例如，cv.glm在boost包中），但通常它们只返回预测误差的 CV 估计值，而不是它的标准误差，或每个折叠的 MSE。

我尝试使用 package DAAG，它的功能CVlm应该比cv.glm. 但是，我似乎无法让它工作！这是我的代码：

a=c(0.0056, 0.0088, 0.0148, 0.0247, 0.0392, 0.0556, 0.0632, 0.0686, 0.0786, 0.0855, 0.0937)
b=c(6.0813, 9.5011, 15.5194, 23.9409, 32.8492, 40.8399, 43.8760, 45.5270, 46.7668, 46.1587, 43.4524)
dataset=data.frame(x=a,y=b)
CV.list=CVlm(df=dataset,form.lm = formula(y ~ poly(x,2)), m=5)

我得到了几乎没有信息的错误

Error in xy.coords(x, y, xlabel, ylabel, log) : 
'x' and 'y' lengths differ

这对我来说没有多大意义。x并且y长度相同（11），因此很明显该函数正在抱怨它在内部创建的其他一些变量x。y

我很乐意接受其他软件包的解决方案（例如caret）。此外，如果我可以指定 k 折交叉验证的重复次数，那就太好了。

score 4 · Accepted Answer

CVlm不喜欢poly(x,2)你的公式中的。poly(x,2)您可以通过在数据表中添加first 的结果并调用CVlm这些新变量来轻松避免这种情况：

dataset2 <- cbind(dataset,poly(dataset$x,2))
names(dataset2)[3:4] <- c("p1","p2")
CV.list=CVlm(df=dataset2,form.lm = formula(y ~ p1+p2))

由于您对打印的值感兴趣，不幸的是这些值没有保存在任何地方，您可以使用以下内容：

# captures the printed output
printOut <- capture.output(CV.list=CVlm(df=dataset2,form.lm = formula(y ~ p1+p2)))

# function to parse the output 
# to be adapted if necessary for your needs
GetValues <- function(itemName,printOut){
    line <- printOut[grep(itemName,printOut)]
    items <- unlist(strsplit(line,"[=]|  +"))
    itemsMat <- matrix(items,ncol=2,byrow=TRUE)
    vectVals <- as.numeric(itemsMat[grep(itemName,itemsMat[,1]),2])
    return(vectVals)
}

# get the Mean square values as a vector
MS <- GetValues("Mean square",printOut)

score 1 · Accepted Answer

平均 MSE 存储为模型对象的属性。 attributes(CV.list)$ms给你你正在寻找的东西。

r - R - 具有估计标准误差的线性回归的 k 折交叉验证

2 回答 2

Related

Reference