r - 使用 R 的自然三次样条回归

Question

我似乎对splines::ns()R 中的函数有疑问。

我创建了一个简单的虚拟问题

dat <- data.frame(t <- seq(0, 6, .01),
                  x <- rnorm(length(t), sd = 1),
                  y <- 5 + t - x^2 + rnorm(length(t), sd = .33))

lm(y ~ t + I(x^2), data = dat)

library(splines)
lm(y ~ t + ns(x, knots = c(0), Boundary.knots = c(-3, 3)), data = dat)

虽然第一个模型工作正常，但第二个模型无法正确识别截距。我在这里想念什么？

score 5 · Accepted Answer

没有错，因为您没有拟合完全相同的模型，它们甚至不等价。

为了解释您看到的不同结果，使用带有单个协变量的更简单示例就足够了x。我们从二次多项式生成数据：5 + x + x^2，然后拟合多个模型。

set.seed(0)
x <- rnorm(500, mean = 1)  ## `x` with non-zero mean
y <- 5 + x + x * x + rnorm(500, sd = 0.5)
library(splines)

fit1 <- lm(y ~ x + I(x^2))
#(Intercept)            x       I(x^2)  
#      4.992        1.032        0.980  

fit2 <- lm(y ~ poly(x, degree = 2))
#(Intercept)  poly(x, degree = 2)1  poly(x, degree = 2)2  
#      7.961                70.198                28.720

fit3 <- lm(y ~ bs(x, degree = 2, df = 2))
#(Intercept)  bs(x, degree = 2, df = 2)1   bs(x, degree = 2, df = 2)2  
#      6.583                      -8.337                       20.650  

fit4 <- lm(y ~ ns(x, df = 2))
#(Intercept)  ns(x, df = 2)1  ns(x, df = 2)2  
#      5.523          10.737          21.265

前 3 个模型在参数化方面并不相同，但它们是等价的：它们都拟合具有 3 个自由度的二次多项式。要查看它们的等价性，我们检查它们的拟合值：

sum(abs(fit1$fitted - fit2$fitted))
# [1] 1.54543e-13

sum(abs(fit1$fitted - fit3$fitted))
# [1] 2.691181e-13

要查看参数化的差异，我们查看设计矩阵：

X1 <- model.matrix(~ x + I(x^2))
X2 <- model.matrix(~ poly(x, degree = 2))
X3 <- model.matrix(~ bs(x, degree = 2, df = 2))

par(mfrow = c(3,3), oma = rep.int(1,4), mar = c(4, 4, 0, 0))

plot(x, X1[, 1], cex = 0.2)
plot(x, X1[, 2], cex = 0.2)
plot(x, X1[, 3], cex = 0.2)

plot(x, X2[, 1], cex = 0.2)
plot(x, X2[, 2], cex = 0.2)
plot(x, X2[, 3], cex = 0.2)

plot(x, X3[, 1], cex = 0.2)
plot(x, X3[, 2], cex = 0.2)
plot(x, X3[, 3], cex = 0.2)

由于设计矩阵不同（无论是形状还是比例），您最终不会得到相同的系数集。如果您感到惊讶，让我们尝试一个更简单的示例：

x1 <- x - mean(x)
test <- lm(y ~ x1 + I(x1^2))
#(Intercept)           x1      I(x1^2)  
#      7.003        2.991        0.980 

sum(abs(fit1$fitted - test$fitted))
# [1] 1.24345e-13

在这里，我只是对进行了一些简单的变换x，那么结果是不同的（但仍然是等价的）。

第 4 个模型fit4是拟合 3 个自由度的三次多项式，因此它并不等同于所有以前的模型。我们可以检查拟合值：

sum(abs(fit1$fitted - fit4$fitted))
# [1] 39.36563

score 1 · Accepted Answer

完全忽略 ns() 你会错过两件事：

1）上面的注释解释了如何定义数据框：

t <- seq(0, 6, .01)
x <- rnorm(length(t), sd = 1)
y <- 5 + t - x^2 + rnorm(length(t), sd = .33)
df <- data.frame(t, x, y)       
rm(t, x, y)

2）您调用模型的方式：

lm(y ~ t + I(t^2), data=df)
lm(y ~ splines::ns(t, knots = c(0), Boundary.knots = c(-3, 3)), data=df)

第一个模型没有正确识别您认为它的作用。

r - 使用 R 的自然三次样条回归

2 回答 2

Related

Reference