r - 将新数据提供给 R 预测函数

Question

R 的predict函数可以接受一个newdata参数，其文档内容如下：

newdata 一个可选的数据框，用于查找用于预测的变量。如果省略，则使用拟合值。

但我发现这并不完全正确，具体取决于模型的拟合方式。例如，以下代码按预期工作：

x <- rnorm(200, sd=10)
y <- x + rnorm(200, sd=1)
data <- data.frame(x, y)
train = sample(1:length(x), size=length(x)/2, replace=F)
dataTrain <- data[train,]
dataTest <- data[-train,]
m <- lm(y ~ x, data=dataTrain)
head(predict(m,type="response"))
head(predict(m,newdata=dataTest,type="response"))

但是，如果模型适合这样：

m2 <- lm(dataTrain$y ~ dataTrain$x)
head(predict(m2,type="response"))
head(predict(m2,newdata=dataTest,type="response"))

最后两行将产生完全相同的结果。该predict函数以忽略newdata参数的方式工作，即它根本无法真正计算对新数据的预测。

罪魁祸首当然是lm(y ~ x, data=dataTrain)对战lm(dataTrain$y ~ dataTrain$x)。但是我没有找到任何提到这两者之间区别的文件。这是一个已知问题吗？

我正在使用 R 2.15.2。

score 16 · Accepted Answer

请参阅?predict.lm我在下面引用的注释部分：

Note:

     Variables are first looked for in ‘newdata’ and then searched for
     in the usual way (which will include the environment of the
     formula used in the fit).  A warning will be given if the
     variables found are not of the same length as those in ‘newdata’
     if it was supplied.

虽然它没有说明“同名”等方面的行为，但就公式而言，您传递给它的术语是形式的，并且在其中或沿途foo$var不存在具有类似名称的此类变量newdataR 将遍历以查找它们的搜索路径。

在第二种情况下，您完全滥用了模型公式表示法；这个想法是简洁和象征性地描述模型。数据对象的简洁性和重复令人作呕是不兼容的。

您注意到的行为与记录的行为完全一致。简单来说，您用项拟合模型，然后尝试预测项data$x和。就 R 而言，它们是不同的名称，因此是不同的东西，不匹配它们是正确的。data$yxy

r - 将新数据提供给 R 预测函数

1 回答 1

Related

Reference