r - 为什么 GLM 的 Hosmer Lemeshow 不适合？

Question

首先，我不是统计学家，所以如果有一些我看不到的非常明显的东西，我深表歉意。如果您能指出我正确的方向，我将不胜感激！

我想计算我的多元逻辑模型的拟合度——它正在查看与某些环境变量（相位、风向和海况 = 分类、云 = 连续）和调查持续时间相关的二进制存在/不存在数据 (Sp3)：

> Model.pres <- glm(Sp3~Phase+Duration+WindDir+Seastate+Clouds, data=Pres_suit, family=binomial)
> summary(Model.pres)

Call:
glm(formula = Sp3 ~ Phase + Duration + WindDir + Seastate + Clouds, 
    family = binomial, data = Pres_suit)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-0.8884  -0.3752  -0.2772  -0.1987   3.0564  

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.5372280  0.3762506  -6.743 1.55e-11 ***
Phase2       0.3922746  0.2375475   1.651 0.098667 .  
Phase3       0.3558603  0.2560635   1.390 0.164609    
Phase4       0.0805965  0.2381698   0.338 0.735062    
Phase5       0.5069505  0.2144266   2.364 0.018068 *  
Phase6       0.4925329  0.2212965   2.226 0.026036 *  
Phase7       0.2931561  0.2525538   1.161 0.245737    
Phase8      -0.3385487  0.2750551  -1.231 0.218383    
Duration     0.0311791  0.0089617   3.479 0.000503 ***
WindDirN    -0.4908414  0.2650803  -1.852 0.064073 .  
WindDirNE   -0.6806600  0.3229646  -2.108 0.035071 *  
WindDirNW   -0.7477190  0.3452913  -2.165 0.030351 *  
WindDirS    -0.1167498  0.2871772  -0.407 0.684344    
WindDirSE   -0.3359101  0.2194478  -1.531 0.125842    
WindDirSW   -0.3228489  0.2366893  -1.364 0.172561    
WindDirVA   -0.0589474  0.1961462  -0.301 0.763774    
WindDirW    -0.5573541  0.2216143  -2.515 0.011904 *  
Seastate1   -0.3681850  0.2218246  -1.660 0.096954 .  
Seastate2   -1.2598255  0.2363087  -5.331 9.75e-08 ***
Seastate3   -2.1172848  0.3083132  -6.867 6.54e-12 ***
Clouds      -0.0009123  0.0018059  -0.505 0.613430    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 2434.6  on 5783  degrees of freedom
Residual deviance: 2256.7  on 5763  degrees of freedom
AIC: 2298.7

Number of Fisher Scoring iterations: 6

> anoGLM<-anova(Model.pres, test="Chisq")
> anoGLM
Analysis of Deviance Table

Model: binomial, link: logit

Response: Sp3

Terms added sequentially (first to last)


         Df Deviance Resid. Df Resid. Dev  Pr(>Chi)    
NULL                      5783     2434.6              
Phase     7   19.877      5776     2414.7 0.0058417 ** 
Duration  1   14.558      5775     2400.1 0.0001359 ***
WindDir   8   39.892      5767     2360.2 3.355e-06 ***
Seastate  3  103.248      5764     2257.0 < 2.2e-16 ***
Clouds    1    0.255      5763     2256.7 0.6135572    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

为了表明模型与数据的拟合程度，我决定检查伪 r2 (Nagelkerke) 和 Hosmer-Lemeshow（因为调整后的 r2 不是正确的选择）：

> #Hosmer-Lemeshow Goodness of Fit (GOF) Test
> hoslem.test(Pres_suit$Sp3, fitted(Model.pres))

    Hosmer and Lemeshow goodness of fit (GOF) test

data:  Pres_suit$Sp3, fitted(Model.pres)
X-squared = 24.492, df = 8, p-value = 0.001894

> #Pseudo r2 
> library(fmsb)
> NagelkerkeR2(Model.pres)
$N
[1] 5784

$R2
[1] 0.08812984

通过低伪 r2 和显着的 HosLem 检验，我使用DHARMa package

> # simulated quantile residuals
> sim.pres <- simulateResiduals(Model.pres)
> sim.pres
Object of Class DHARMa with simulated residuals based on 250 simulations with refit = FALSE . See ?DHARMa::simulateResiduals for help. 
 
Scaled residual values: 0.6273667 0.3875756 0.3128554 0.5701987 0.3334376 0.3622152 0.674742 0.8650855 0.7833462 0.523428 0.05092012 0.1867278 0.005426142 0.7711206 0.5303359 0.8879234 0.2274915 0.7091264 0.2288981 0.7708964 ...
> plot(sim.pres)


[![plot(sim.pres)][1]][1]


> # Dispersion test
> testDispersion(sim.pres) # tests under and overdispersion

    DHARMa nonparametric dispersion test via sd of residuals fitted vs. simulated

data:  simulationOutput
ratioObsSim = 0.99815, p-value = 0.968
alternative hypothesis: two.sided

> # Outlier test
> testOutliers(sim.pres, type = "binomial") 

    DHARMa outlier test based on exact binomial test with approximate expectations

data:  sim.pres
outliers at both margin(s) = 40, observations = 5784, p-value = 0.4155
alternative hypothesis: true probability of success is not equal to 0.007968127
95 percent confidence interval:
 0.004945089 0.009405336
sample estimates:
frequency of outliers (expected: 0.00796812749003984 ) 
                                           0.006915629 


[![Dispersion and Outlier test plots][1]][1]

> # testing zero inflation
> testZeroInflation(sim.pres) # tests if there are more zeros than expected

    DHARMa zero-inflation test via comparison to expected zeros with simulation under H0 =
    fitted model

data:  simulationOutput
ratioObsSim = 1, p-value = 0.968
alternative hypothesis: two.sided

[![Zero inflation plot][1]][1]

我只是找不到重要的 HL 测试的原因，有人有什么想法吗？我也在检查其他物种，并且有一些甚至更多的零，但它们没有产生显着的 HL（尽管 Nagelkerke r2 甚至更低）。还有什么我需要检查的吗？

r - 为什么 GLM 的 Hosmer Lemeshow 不适合？

0 回答 0

Related

Reference