mi在过去的几年中,该软件包似乎在某个时候进行了相当大的重写。
下面的教程很好地概述了“旧”的做事方式:http: //thomasleeper.com/Rcourse/Tutorials/mi.html
“新”的做事方式(坚持 Leeper 的模拟演示)看起来像这样:
#load mi
library(mi)
#set seed
set.seed(10)
#simulate some data (with some observations missing)
x1 <- runif(100, 0, 5)
x2 <- rnorm(100)
y <- 2*x1 + 20*x2 + rnorm(100)
mydf <- cbind.data.frame(x1, x2, y)
mydf$x1[sample(1:nrow(mydf), 20, FALSE)] <- NA
mydf$x2[sample(1:nrow(mydf), 10, FALSE)] <- NA
# Convert to a missing_data.frame
mydf_mdf <- missing_data.frame(mydf)
# impute
mydf_imp <- mi(mydf_mdf)
尽管函数名称发生了变化,但这实际上与“旧”的做事方式非常相似。
最大的变化(从我的角度来看)是替换以下“旧”功能
lm.mi(formula, mi.object, ...)
glm.mi(formula, mi.object, family = gaussian, ...)
bayesglm.mi(formula, mi.object, family = gaussian, ...)
polr.mi(formula, mi.object, ...)
bayespolr.mi(formula, mi.object, ...)
lmer.mi(formula, mi.object, rescale=FALSE, ...)
glmer.mi(formula, mi.object, family = gaussian, rescale=FALSE, ...).
以前,用户可以使用这些函数之一为每个估算数据集计算模型,然后使用mi.pooled()(或者coef.mi()如果我们遵循 Leeper 示例)将结果汇集起来。
在当前版本mi(我安装了 v1.0)中,最后这些步骤似乎已合并为一个函数,pool(). 该pool()函数似乎读取了在上述插补过程中分配给变量的族和链接函数,然后bayesglm使用指定的公式估计模型,如下所示。
# run models on imputed data and pool the results
summary(pool(y ~ x1 + x2, mydf_imp))
## 
## Call:
## pool(formula = y ~ x1 + x2, data = mydf_imp)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.98754  -0.40923   0.03393   0.46734   2.13848  
## 
## Coefficients:
##             Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -0.34711    0.25979  -1.336    0.215    
## x1           2.07806    0.08738  23.783 1.46e-13 ***
## x2          19.90544    0.11068 179.844  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.7896688)
## 
##     Null deviance: 38594.916  on 99  degrees of freedom
## Residual deviance:    76.598  on 97  degrees of freedom
## AIC: 264.74
## 
## Number of Fisher Scoring iterations: 7
这看起来我们即将恢复我们的模拟 beta 值(2 和 20)。换句话说,它的行为符合预期。
为了获得分组变量,让我们使用具有天真模拟随机效应的稍大的数据集。
mydf2 <- data.frame(x1 = rep(runif(100, 0, 5), 20)
                   ,x2 = rep(rnorm(100, 0, 2.5), 20)
                   ,group_var = rep(1:20, each = 100)
                   ,noise = rep(rnorm(100), 20))
mydf2$y <- 2*mydf2$x1 + 20*mydf2$x2 + mydf2$noise
mydf2$x1[sample(1:nrow(mydf2), 200, FALSE)] <- NA
mydf2$x2[sample(1:nrow(mydf2), 100, FALSE)] <- NA
# Convert to a missing_data.frame
mydf2_mdf <- missing_data.frame(mydf2)
show(mydf2_mdf)
## Object of class missing_data.frame with 2000 observations on 5 variables
## 
## There are 4 missing data patterns
## 
## Append '@patterns' to this missing_data.frame to access the corresponding pattern for every observation or perhaps use table()
## 
##                 type missing method  model
## x1        continuous     200    ppd linear
## x2        continuous     100    ppd linear
## group_var continuous       0   <NA>   <NA>
## noise     continuous       0   <NA>   <NA>
## y         continuous       0   <NA>   <NA>
## 
##             family     link transformation
## x1        gaussian identity    standardize
## x2        gaussian identity    standardize
## group_var     <NA>     <NA>    standardize
## noise         <NA>     <NA>    standardize
## y             <NA>     <NA>    standardize
由于missing_data.frame()似乎将其解释group_var为连续的,因此我使用change()函数 frommi重新分配给"un"“无序分类”,然后按上述方式进行。
mydf2_mdf <- change(mydf2_mdf, y = "group_var", what = "type", to = "un"  )
# impute
mydf2_imp <- mi(mydf2_mdf)
现在,除非 1.0 版删除了以前版本的功能(即和mi可用的功能),否则我会假设在公式中添加随机效应应该指向适当的函数。但是,最初的错误消息表明情况并非如此。lmer.miglmer.mipool()lme4
# run models on imputed data and pool the results
summary(pool(y ~ x1 + x2 + (1|group_var), mydf2_imp))
## Warning in Ops.factor(1, group_var): '|' not meaningful for factors
## Warning in Ops.factor(1, group_var): '|' not meaningful for factors
## Error in if (prior.scale[j] < min.prior.scale) {: missing value where TRUE/FALSE needed
遵循我的警告消息并从我的因子中提取整数确实让我得到了一个估计值,但结果表明它pool()仍在估计一个固定效应模型,bayesglm并保持我尝试的随机效应常数。
summary(pool(y ~ x1 + x2 + (1|as.numeric(as.character(group_var))), mydf2_imp))
## 
## Call:
## pool(formula = y ~ x1 + x2 + (1 | as.numeric(as.character(group_var))), 
##     data = mydf2_imp)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.93633  -0.69923   0.01073   0.56752   2.12167  
## 
## Coefficients:
##                                               Estimate Std. Error  t value
## (Intercept)                                  1.383e-01  2.596e+02    0.001
## x1                                           1.995e+00  1.463e-02  136.288
## x2                                           2.000e+01  8.004e-03 2499.077
## 1 | as.numeric(as.character(group_var))TRUE -3.105e-08  2.596e+02    0.000
##                                             Pr(>|t|)    
## (Intercept)                                        1    
## x1                                            <2e-16 ***
## x2                                            <2e-16 ***
## 1 | as.numeric(as.character(group_var))TRUE        1    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for gaussian family taken to be 0.8586836)
## 
##     Null deviance: 5384205.2  on 1999  degrees of freedom
## Residual deviance:    1713.9  on 1996  degrees of freedom
## AIC: 5377
## 
## Number of Fisher Scoring iterations: 4
我的问题是:
- 是否可以使用mi? 轻松生成汇总随机效应估计,以及
- 如果是,如何?