r - 使用来自混合效应模型 (lme4) 和模型平均 (MuMIn) 的二项式数据绘制逻辑回归结果

Question

我正在尝试显示逻辑回归的结果。我的模型使用 lme4 包中的 glmer() 拟合，然后我使用 MuMIn 进行模型平均。

使用数据集的模型的简化版本mtcars：

glmer(vs ~ wt +  am + (1|carb), database, family = binomial, na.action = "na.fail")

我想要的输出是两个图，显示vs= 1 的预测概率，一个为wt，它是连续的，一个为am，它是二项式的。

在@KamilBartoń 发表评论后，我得到了这么多工作：

database <- mtcars

# Scale data
database$wt <- scale(mtcars$wt)
database$am <- scale(mtcars$am)

# Make global model
model.1 <- glmer(vs ~ wt + am + (1|carb), database, family = binomial, na.action = "na.fail")

# Model selection
model.1.set <- dredge(model.1, rank = "AICc")

# Get models with <10 delta AICc
top.models.1 <- get.models(model.1.set,subset = delta<10)

# Model averaging
model.1.avg <- model.avg(top.models.1)

# make dataframe with all values set to their mean
xweight <- as.data.frame(lapply(lapply(database[, -1], mean), rep, 100))

# add new sequence of wt to xweight along range of data
xweight$wt <- (wt = seq(min(database$wt), max(database$wt), length = 100))

# predict new values
yweight <- predict(model.1.avg, newdata = xweight, type="response", re.form=NA)

# Make plot 
plot(database$wt, database$vs, pch = 20, xlab = "WEIGHT (g)", ylab = "VS")

# Add predicted line
lines(xweight$wt, yweight)

产生：

剩下的问题是数据被缩放并以 0 为中心，这意味着无法解释图表。我可以使用@BenBolker对此问题的回答对数据进行缩放，但这不能正确显示：

## Ben Bolker's unscale function:
## scale variable x using center/scale attributes of variable y
scfun <- function(x,y) {
  scale(x,
        center=attr(y,"scaled:center"),
        scale=attr(y,"scaled:scale"))
        }

## scale prediction frame with scale values of original data -- for all variables
xweight_sc <- transform(xweight,
                        wt = scfun(wt, database$wt),
                        am = scfun(am, database$am))

# predict new values
yweight <- predict(model.1.avg, newdata = xweight_sc, type="response", re.form=NA)

# Make plot 
plot(mtcars$wt, mtcars$vs, pch = 20, xlab = "WEIGHT (g)", ylab = "VS")

# Add predicted line
lines(xweight$wt, yweight)

产生：

我已经尝试了几种不同的方法，但无法解决问题所在。我做错了什么？

另外，另一个遗留问题：如何制作二项式图am？

score 2 · Accepted Answer

设置

library(lme4)
library(MuMIn)
database <- mtcars
database$wt <- scale(mtcars$wt)
database$am <- factor(mtcars$am) ## <== note the difference here. It is a factor not numeric
model.1 <- glmer(vs ~ wt + am + (1|carb), database, family = binomial, na.action = "na.fail")
model.1.set <- dredge(model.1, rank = "AICc")
top.models.1 <- get.models(model.1.set,subset = delta<10)
model.1.avg <- model.avg(top.models.1)
nPoints <- 100
wt_pred_data <- data.frame(wt = seq(min(database$wt), max(database$wt), length = nPoints),
                           am = database$am[which.min(database$am)], #Base level for the factor
                           var = 'wt')
am_pred_data <- data.frame(wt = mean(database$wt), 
                           am = unique(database$am),
                           var = 'am')
pred_data <- rbind(wt_pred_data, am_pred_data)
rm(wt_pred_data, am_pred_data)
pred_data$vs <- predict(model.1.avg, newdata = pred_data, re.form = NA, type = 'response')

实际答案

添加到我之前的答案中，因为托马斯似乎对如何处理factors 以及如何使用引导程序获得置信区间感兴趣。

处理因素

首先处理因子并不比处理数值变量难多少。这里的区别在于

在绘制对数值变量的影响时，应将因子设置为它们的基本水平（例如，am作为一个因子，这将是一个值 1）
绘制因子时，将所有数值变量设置为其平均值，将所有其他因子设置为其基本水平。

获得因子基本水平的一种方法是factor[which.min(factor)]，而另一种方法是factor(levels(factor)[0], levels(factor))。该ggeffects包使用一些类似于此的方法。

自举

现在，实践中的引导从容易到困难。可以使用参数、半参数或非参数引导程序。
非参数引导是最容易解释的。一个人只需简单地从原始数据集（比如 2/3、3/4 或 4/5。Less 可用于“好”的较大数据集）中提取样本，使用该样本重新拟合模型，然后预测该新模型。然后重复该过程 N 次，并将其用于估计标准偏差或分位数，并将其用于置信区间。似乎没有实现的方法MuMIn来为我们处理这个问题，所以我们似乎必须自己处理这个问题。
通常代码会变得非常混乱，使用函数可以使其更清晰。令我沮丧的是MuMIn然而，这似乎有问题，所以下面是一种非功能性的方法。在这段代码中，我选择了 4/5 的样本大小，因为数据集的大小相当小。

###                            ###
## Non-parametric bootstrapping ##
## Note: Gibberish with         ##
##       singular fit!          ##
###                            ###

# 1) Create sub-sample from the dataset (eg 2/3, 3/4 or 4/5 of the original dataset)
# 2) refit the model using the new dataset and estimate model average using this dataset
# 3) estimate the predicted values using the refitted model
# 4) refit the model N times

nBoot <- 100
frac <- 4/5 #number of points in each sample. Better datasets can use less.
bootStraps <- vector('list', nBoot)
shutup <- function(x) #Useful helper function for making a function shut up
  suppressMessages(suppressWarnings(force(x)))
ii <- seq_len(n <- nrow(database))
nn <- ceiling(frac * n)
nb <- nn * nBoot
samples <- sample(ii, nb, TRUE)
samples <- split(samples, (nn + seq_len(nb) - 1) %/% nn) #See unique((nn + seq_len(nb) - 1) %/% nn) # <= Gives 1 - 100.
#Not run:
# lengths(samples) # <== all of them are 26 long! ceiling(frac * n) = 26!
# Run the bootstraps
for(i in seq_len(nBoot)){
  preds <- try({
    # 1) Sample 
    d <- database[samples[[i]], ]
    # 2) fit the model using the sample
    bootFit <- shutup(glmer(vs ~ wt + am + (1|carb), d, family = binomial, na.action = "na.fail"))
    bootAvg <- shutup(model.avg(get.models(dredge(bootFit, rank = 'AICc'), subset = delta < 10)))
    # 3) predict the data using the new model
    shutup(predict(bootAvg, newdata = pred_data, re.form = NA, type = 'response'))
  }, silent = TRUE)
  #save the predictions for later
  if(!inherits(preds, 'try-error'))
    bootStraps[[i]] <- preds
  # repeat N times
}
# Number of failed bootStraps:
sum(failed <- sapply(bootStraps, is.null)) #For me 44, but will be different for different datasets, samples and seeds.
bootStraps <- bootStraps[which(!failed)]
alpha <- 0.05
# 4) use the predictions for calculating bootstrapped intervals
quantiles <- apply(do.call(rbind, bootStraps), 2, quantile, probs = c(alpha / 2, 0.5, 1 - alpha / 2))
pred_data[, c('lower', 'median', 'upper')] <-  t(quantiles)
pred_data[, 'type'] <- 'non-parametric'

请注意，这当然完全是胡言乱语。拟合是奇异的，因为mtcars不是显示混合效应的数据集，因此自举置信区间将完全不合常理（值的范围过于分散）。还要注意，对于这样一个不稳定的数据集，相当多的引导程序无法收敛到合理的东西。

对于参数引导，我们可以转向lme4::bootMer. 此函数采用单个merMod模型（glmer或lmer结果）以及要在每个参数改装上评估的函数。所以创建这个函数bootMer可以处理剩下的事情。我们对预测值感兴趣，所以函数应该返回这些。注意功能的相似之处，与上述方法

###                     ###
## Parametric bootstraps ##
## Note: Singular fit    ##
##       makes this      ##
##       useless!        ##
###                     ###
bootFun <- function(model){
  preds <- try({
    bootAvg <- shutup(model.avg(get.models(dredge(model, rank = 'AICc'), subset = delta < 10)))
    shutup(predict(bootAvg, newdata = pred_data, re.form = NA, type = 'response'))
  }, silent = FALSE)
  if(!inherits(preds, 'try-error'))
    return(preds)
  return(rep(NA_real_, nrow(pred_data)))
}
boots <- bootMer(model.1, FUN = bootFun, nsim = 100, re.form = NA, type = 'parametric')
quantiles <- apply(boots$t, 2, quantile, probs = c(alpha / 2, 0.5, 1 - alpha / 2), na.rm = TRUE)
# Create data to be predicted with parametric bootstraps
pred_data_p <- pred_data
pred_data_p[, c('lower', 'median', 'upper')] <- t(quantiles)
pred_data_p[, 'type'] <- 'parametric'
pred_data <- rbind(pred_data, pred_data_p)
rm(pred_data_p)

再次注意，由于奇点，结果将是胡言乱语。在这种情况下，结果将过于确定，因为奇异性意味着模型在已知数据上过于准确。所以在实践中，这将使每个间隔的范围为 0 或足够接近以至于没有区别。

最后我们只需要绘制结果。我们可以facet_wrap用来比较参数和非参数的结果。再次注意，对于这个特定的数据集，比较完全无用的置信区间是非常无用的。

请注意，对于我使用的因子am和geom_point我geom_errorbar使用的位置geom_line以及geom_ribbon数值，与数值变量的连续性质相比，为了更好地表示因子的分组性质


#Finaly we can plot our result:
# wt
library(ggplot2)
ggplot(pred_data[pred_data$var == 'wt', ], aes(y = vs, x = wt)) + 
  geom_line() + 
  geom_ribbon(aes(ymax = upper, ymin = lower), alpha = 0.2) + 
  facet_wrap(. ~ type) + 
  ggtitle('gibberish numeric plot (caused by singularity in fit)')

# am
ggplot(pred_data[pred_data$var == 'am', ], aes(y = vs, x = am)) + 
  geom_point() + 
  geom_errorbar(aes(ymax = upper, ymin = lower)) + 
  facet_wrap(. ~ type) + 
  ggtitle('gibberish factor plot (caused by singularity in fit)')

score 1 · Accepted Answer

设置

library(lme4)
library(MuMIn)
database <- mtcars
database$wt <- scale(mtcars$wt)
database$am <- scale(mtcars$am)
model.1 <- glmer(vs ~ wt + am + (1|carb), database, family = binomial, na.action = "na.fail")
model.1.set <- dredge(model.1, rank = "AICc")
top.models.1 <- get.models(model.1.set,subset = delta<10)
model.1.avg <- model.avg(top.models.1)

回答

手头的问题似乎是创建一个类似于effects包（或 ggeffects包）的平均效果图。Thomas 非常接近，但是对Ben Bolkers回答的一个小误解导致了缩放过程的反转，在这种情况下导致参数的双重缩放。这可以通过摘录上面的代码在下面看到。

database$wt <- scale(mtcars$wt)
database$am <- scale(mtcars$am)

# More code

xweight <- as.data.frame(lapply(lapply(database[, -1], mean), rep, 100))
xweight$wt <- (wt = seq(min(database$wt), max(database$wt), length = 100))

# more code 

scfun <- function(x,y) {
  scale(x,
        center=attr(y,"scaled:center"),
        scale=attr(y,"scaled:scale"))
        }
xweight_sc <- transform(xweight,
                        wt = scfun(wt, database$wt),
                        am = scfun(am, database$am))

从这里我们看到xweight实际上已经被缩放了，因此使用第二次缩放，我们得到

sc <- attr(database$wt, 'scaled:scale')
ce <- attr(database$wt, 'scaled:center')
xweight_sc$wt <- scale(scale(seq(min(mtcars$wt), max(mtcars$wt), ce, sc), ce, sc)

然而，Ben Bolker在他的回答中所谈论的是模型使用比例预测变量而用于预测的数据不是的情况。在这种情况下，数据被正确缩放，但人们希望将其解释为原始比例。我们只需要颠倒这个过程。为此，可以使用 2 种方法。

方法1：更改ggplot中的中断

xlab注意：可以在基础 R中使用自定义标签。

改变轴的一种方法是……改变轴。这允许一个人保留数据并且只重新调整标签。

# Extract scales
sc <- attr(database$wt, 'scaled:scale')
ce <- attr(database$wt, 'scaled:center')
# Create plotting and predict data
n <- 100
pred_data <- aggregate(. ~ 1, data = mtcars, FUN = mean)[rep(1, 100), ]
pred_data$wt <- seq(min(database$wt), max(database$wt), length = n)
pred_data$vs <- predict(model.1.avg, newdata = pred_data, type = 'response', re.form = NA)  
# Create breaks
library(scales) #for pretty_breaks and label_number
breaks <- pretty_breaks()(pred_data$wt, 4) #4 is abritrary
# Unscale the breaks to be used as labels
labels <- label_number()(breaks * sc + ce) #See method 2 for explanation
# Finaly we plot the result
library(ggplot2)
ggplot(data = pred_data, aes(x = wt, y = vs)) + 
  geom_line() + 
  geom_point(data = database) + 
  scale_x_continuous(breaks = breaks, labels = labels) #to change labels.

这是期望的结果。请注意，没有置信带，这是由于原始模型的置信区间缺乏封闭形式，而且似乎获得任何估计的最佳方法是使用自举。

方法2：缩放

在 unscaling 中，我们简单地反转的过程scale。因为scale(x)= (x - mean(x))/sd(x)我们只需要隔离 x: x = scale(x) * sd(x) + mean(x)，这是要完成的过程，但仍然记得在预测期间使用缩放数据：

# unscale the variables 
pred_data$wt <- pred_data$wt * sc + ce
database$wt <- database$wt * sc + ce

# Finally plot the result
ggplot(data = pred_data, aes(x = wt, y = vs)) + 
         geom_line() + 
         geom_point(data = database)

这是期望的结果。

score 0 · Accepted Answer

您可以为此使用ggeffects-package，或者使用ggpredict()or ggeffect()（参见?ggpredict这两个函数的区别，第一个调用predict()，后者effects::Effect()）。

library(ggeffects)
library(sjmisc)
library(lme4)
data(mtcars)

mtcars <- std(mtcars, wt)
mtcars$am <- as.factor(mtcars$am)

m <- glmer(vs ~ wt_z + am + (1|carb), mtcars, family = binomial, na.action = "na.fail")

# Note the use of the "all"-tag here, see help for details
ggpredict(m, "wt_z [all]") %>% plot()

ggpredict(m, "am") %>% plot()

r - 使用来自混合效应模型 (lme4) 和模型平均 (MuMIn) 的二项式数据绘制逻辑回归结果

3 回答 3

设置

实际答案

处理因素

自举

设置

回答

方法1：更改ggplot中的中断

方法2：缩放

Related

Reference