r - R geepack：使用 GEE 进行不合理的大估计

Question

我正在使用geepackR 来估计逻辑边际模型geeglm()。但我得到了垃圾估计。它们大约大了 16 个数量级。但是 p 值似乎与我的预期相似。这意味着响应本质上变成了阶跃函数。见附图

这是生成绘图的代码：

require(geepack)
data = read.csv(url("http://folk.uio.no/mariujon/data.csv"))
fit = geeglm(moden ~ 1 + power, id = defacto, data=data, corstr = "exchangeable", family=binomial)
summary(fit)
plot(moden ~ power, data=data)
x = 0:2500
y = predict(fit, newdata=data.frame(power = x), type="response" )
lines(x,y)

这是回归表：

Call:
geeglm(formula = moden ~ 1 + power, family = binomial, data = data, 
    id = defacto, corstr = "exchangeable")

 Coefficients:
             Estimate   Std.err  Wald Pr(>|W|)    
(Intercept) -7.38e+15  1.47e+15  25.1  5.4e-07 ***
power        2.05e+13  1.60e+12 164.4  < 2e-16 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Estimated Scale Parameters:
            Estimate  Std.err
(Intercept) 1.03e+15 1.65e+37

Correlation: Structure = exchangeable  Link = identity 

Estimated Correlation Parameters:
      Estimate  Std.err
alpha    0.196 3.15e+21
Number of clusters:   3   Maximum cluster size: 381

希望得到一些帮助。谢谢！

亲切的问候，

马吕斯

score 7 · Accepted Answer

我将给出三个程序，每个程序都是边缘化随机截距模型（MRIM）。这些 MRIM 的系数具有边际逻辑解释，并且比 GEE 的量级小：

| Model | (Intercept) |  power |  LogL  |
|-------|-------------|--------|--------| 
| `L_N` |       -1.050| 0.00267|  -270.1|
| `LLB` |       -0.668| 0.00343|  -273.8|
| `LPN` |       -1.178| 0.00569|  -266.4|

与不考虑任何相关性的 glm 相比，供参考：

| Model | (Intercept) |  power |  LogL  |
|-------|-------------|--------|--------| 
|  strt |       -0.207| 0.00216|  -317.1|

边缘化随机截距模型 (MRIM) 值得探索，因为您需要一个具有可交换相关结构的边缘模型来用于聚类数据，这就是 MRIM 所展示的结构类型。

文献的代码（尤其是带有注释的 R 脚本）和 PDF 位于GITHUB 存储库中。我在下面详细介绍了代码和文献。

MRIM 的概念自 1999 年就已经存在，有关此的一些背景资料在GITHUB repo中。我建议先阅读Swihart et al 2014，因为它回顾了其他论文。

按年代顺序 -

L_N Heagerty (1999)：该方法适合具有正态分布随机截距的随机截距逻辑模型。诀窍是随机截距模型中的预测器是用边际系数非线性参数化的，因此得到的边际模型具有边际逻辑解释。它的代码是lnMLER 包（不在 CRAN 上，而是在 Patrick Heagerty 的网站上）。这种方法L_N在代码中表示为在边缘表示 logit (L)，在条件尺度 (_) 上没有解释和正态 (N) 分布的随机截距。
LLB Wang & Louis (2003)：该方法适合带有桥分布随机截距的随机截距逻辑模型。与 Heagerty 1999 中的技巧是随机截距模型的非线性预测器不同，技巧是一种特殊的随机效应分布（桥分布），它允许随机截距模型和生成的边际模型都具有逻辑解释。它的代码是用gnlmix4MMM.R（在 repo 中）实现的，它使用rmutil和repeatedR 包。这种方法LLB在代码中表示为边际上的 logit (L)，条件尺度上的 logit (L) 和桥 (B) 分布式截距。
LPN Caffo 和 Griswold (2006)：该方法适合具有正态分布随机截距的随机截距概率模型，而 Heagerty 1999 使用logit随机截距模型。这种替换使计算更容易，并且仍然产生边际 logit 模型。它的代码是用gnlmix4MMM.R（在 repo 中）实现的，它使用rmutil和repeatedR 包。这种方法LPN在代码中表示为在边际上表示 logit (L)，在条件尺度上表示概率 (P) 和正态 (N) 分布截距。
Griswold et al (2013)：另一个评论/实用介绍。
Swihart et al 2014：这是 Heagerty 1999 和 Wang & Louis 2003 以及其他人的评论论文，并概括了 MRIM 方法。最有趣的概括之一是允许边际模型和条件模型中的逻辑 CDF（等效地，logit 链接）改为近似逻辑 CDF 的稳定分布。它的代码是用gnlmix4MMM.R（在 repo 中）实现的，它使用rmutil和repeatedR 包。我SSS在R 脚本中用注释表示这一点，以指示边缘上的稳定 (S)、条件尺度上的稳定 (S) 和稳定 (S) 分布式截距。它包含在 R 脚本中，但在这篇关于 SO 的帖子中没有详细说明。

准备

#code from OP Question: edit `data` to `d` 
require(geepack)
d = read.csv(url("http://folk.uio.no/mariujon/data.csv"))
fit = geeglm(moden ~ 1 + power, id = defacto, data=d, corstr = "exchangeable", family=binomial)
summary(fit)
plot(moden ~ power, data=d)
x = 0:2500
y = predict(fit, newdata=data.frame(power = x), type="response" )
lines(x,y)
#get some starting values from glm():
strt <- coef(glm(moden ~  power, family = binomial, data=d))
strt
#I'm so sorry but these methods use attach()
attach(d)

`L_N`海格蒂 (1999)

# marginally specifies a logit link and has a nonlinear conditional model
# the following code will not run if lnMLE is not successfully installed.  
# See https://faculty.washington.edu/heagerty/Software/LDA/MLV/
library(lnMLE)
L_N <- logit.normal.mle(meanmodel = moden ~ power,
                        logSigma= ~1,
                        id=defacto,
                        model="marginal",
                        data=d,
                        beta=strt,
                        r=10) 
print.logit.normal.mle(L_N)

准备`LLB`和`LPN`

library("gnlm")
library("repeated")
source("gnlmix4MMM.R") ## see ?gnlmix; in GITHUB repo 
y <- cbind(d$moden,(1-d$moden))

`LLB`王和路易斯 (2003)

LLB  <- gnlmix4MMM(y = y,
                   distribution = "binomial",
                   mixture = "normal",
                   random = "rand",
                   nest = defacto,
                   mu = ~ 1/(1+exp(-(a0 + a1*power)*sqrt(1+3/pi/pi*exp(pmix)) - sqrt(1+3/pi/pi*exp(pmix))*log(sin(pi*pnorm(rand/sqrt(exp(pmix)))/sqrt(1+3/pi/pi*exp(pmix)))/sin(pi*(1-pnorm(rand/sqrt(exp(pmix))))/sqrt(1+3/pi/pi*exp(pmix)))))),
                   pmu = c(strt, log(1)),
                   pmix = log(1))

print("code: 1 -best 2-ok 3,4,5 - problem")
LLB$code
print("coefficients")
LLB$coeff
print("se")
LLB$se

`LPN`卡福和格里斯沃尔德 (2006)

LPN  <- gnlmix4MMM(y = y,
                   distribution = "binomial",
                   mixture = "normal",
                   random = "rand",
                   nest = defacto,
                   mu = ~pnorm(qnorm(1/(1+exp(-a0 - a1*power)))*sqrt(1+exp(pmix)) + rand),
                   pmu = c(strt, log(1)),
                   pmix = log(1))

print("code: 1 -best 2-ok 3,4,5 - problem")
LPN$code
print("coefficients")
LPN$coeff
print("se")
LPN$se

来自 3 种方法的系数：

rbind("L_N"=L_N$beta, "LLB" = LLB$coefficients[1:2], "LPN"=LPN$coefficients[1:2])

3 个模型的最大对数似然：

rbind("L_N"=L_N$logL, "LLB" = -LLB$maxlike, "LPN"=-LPN$maxlike)

r - R geepack：使用 GEE 进行不合理的大估计

1 回答 1

准备

L_N海格蒂 (1999)

准备LLB和LPN

LLB王和路易斯 (2003)

LPN卡福和格里斯沃尔德 (2006)