na - GAM 回归 (mgcv) 如何处理重复值？

Question

我正在尝试探索丰度和 3 个变量之间的回归。我的数据（test.gam）如下所示：

# A tibble: 6 x 5
  Site   Abundance      SPM isotherm SiOH4
  <chr>      <dbl>    <dbl>    <dbl> <dbl>
1 cycle1     0.769 5960367.    102.   18.2
2 cycle1     0.632 6496360.     97.5  18.2
3 cycle1     0.983 5328652.    105    18.2
4 cycle1     1     6212034.    110    18.2
5 cycle1     0.821 5468987.    105    18.2
6 cycle1     0.734 5280549.    112.   18.2

在其中一个变量（SiOH4）中，每个站点只有一个值，而对于其他 2 个变量，每个站点都有一个值（每行都是一个站点）。

为了绘制丰度和 SiOH4 之间的关系，我只需计算每个站点的平均值。该关系表明，随着SiOH4 水平的增加，丰度不断增加：Plot1。

现在我尝试使用以下代码对这些数据运行 GAM：

mod_gam1 <- gam(Abundance ~ s(isotherm, bs = "cr", k = 5)
+ SPM + s(SiOH4, bs = "cr", k = 5), data = test.gam, family = gaussian(link = log), gamma = 1.4)

给我这些结果：

Family: gaussian 
Link function: log 

Formula:
Abundance ~ s(isotherm, bs = "cr", k = 5) + SPM + s(SiOH4, bs = "cr", 
    k = 5)

Parametric coefficients:
              Estimate Std. Error t value Pr(>|t|)    
(Intercept) -8.182e-01  8.244e-02  -9.925  < 2e-16 ***
SPM         -4.356e-08  1.153e-08  -3.778 0.000219 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Approximate significance of smooth terms:
              edf Ref.df      F  p-value    
s(isotherm) 2.019  2.485 10.407 1.46e-05 ***
s(SiOH4)    3.861  3.986  9.823 1.01e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

R-sq.(adj) =  0.492   Deviance explained = 51.2%
GCV = 0.044202  Scale est. = 0.040674  n = 177

所以我对结果很满意，但是通过检查gam.check，我发现 k 太低了。

Method: GCV   Optimizer: outer newton
full convergence after 8 iterations.
Gradient range [-8.801477e-14,5.555545e-13]
(score 0.04420205 & scale 0.04067442).
Hessian positive definite, eigenvalue range [6.631202e-05,7.084933e-05].
Model rank =  10 / 10 

Basis dimension (k) checking results. Low p-value (k-index<1) may
indicate that k is too low, especially if edf is close to k'.

              k'  edf k-index p-value    
s(isotherm) 4.00 2.02    0.85    0.01 ** 
s(SiOH4)    4.00 3.86    0.59  <2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

请注意，我自愿将 k 设置为 5，否则会出现模式过拟合。

我认为这可能是因为其中的许多值SiOH4都是重复的。通过修改我的数据以仅保留每个站点的第一个值（用 NA 替换所有其他行），例如：

# A tibble: 6 x 5
# Groups:   Site [1]
  Site   Abundance      SPM isotherm SiOH4
  <chr>      <dbl>    <dbl>    <dbl> <dbl>
1 cycle1     0.769 5960367.    102.   18.2
2 cycle1     0.632 6496360.     97.5  NA  
3 cycle1     0.983 5328652.    105    NA  
4 cycle1     1     6212034.    110    NA  
5 cycle1     0.821 5468987.    105    NA  
6 cycle1     0.734 5280549.    112.   NA

我希望防止这种重复的水平。但是这样我也失去了我的大部分行，na.omit选项打开了。但是运行相同的 GAM，我在使用gam.check.

那么最好的方法是什么？保持重复值并忽略警告，gam.check或者即使存在 NA 也有办法以某种方式保留所有行？

谢谢！！！

na - GAM 回归 (mgcv) 如何处理重复值？

0 回答 0

Related

Reference