r - 使用 rstan 对贝叶斯模型进行线性模型诊断

Question

我正在寻找一种有效的方法来识别对线性模型的参数有巨大影响的数据点。这对于普通的线性模型来说是直截了当的，但我不确定如何使用贝叶斯线性模型来做到这一点。

这是使用普通线性模型的一种方法，我们可以计算每个数据点的库克距离，并绘制包含库克距离的诊断图：

# ordinary linear model diagnostics, similar to my use-case
library(dplyr)
library(purrr)
library(tidyr)
library(broom)
# make linear models for each type of species
xy <- 
  iris %>% 
  nest(-Species) %>% 
  mutate(model = map(data, 
                     ~lm(Sepal.Length ~ Petal.Width, 
                         data = .)),
         fit = map(model, augment))

这里我们有一个带有嵌套列表的数据框，该model列包含每个物种的线性模型：

> xy
# A tibble: 3 × 4
     Species              data    model                   fit
      <fctr>            <list>   <list>                <list>
1     setosa <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
2 versicolor <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
3  virginica <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>

该broom::augment函数允许我们将每个数据点的 Cook 距离值添加到该数据帧中，我们可以像这样检查它们：

# inspect Cook's distance values
xy %>% 
 unnest(fit) %>% 
 arrange(desc(.cooksd))

  # A tibble: 150 × 10
      Species Sepal.Length Petal.Width  .fitted    .se.fit     .resid       .hat    .sigma    .cooksd
       <fctr>        <dbl>       <dbl>    <dbl>      <dbl>      <dbl>      <dbl>     <dbl>      <dbl>
1  versicolor          5.9         1.8 6.612097 0.16181001 -0.7120969 0.13725081 0.4269862 0.24507448
2      setosa          5.0         0.6 5.335281 0.17114108 -0.3352811 0.25027563 0.3410686 0.21385214
3   virginica          4.9         1.7 6.375829 0.13613717 -1.4758292 0.04875277 0.5826838 0.15434787
4      setosa          5.7         0.4 5.149247 0.08625887  0.5507534 0.06357957 0.3355980 0.09396588
5      setosa          4.3         0.1 4.870195 0.08321347 -0.5701948 0.05916942 0.3349111 0.09285408
6   virginica          5.8         2.4 6.831411 0.14828703 -1.0314106 0.05784319 0.6035012 0.09117693
7   virginica          7.2         1.6 6.310746 0.16207266  0.8892538 0.06909799 0.6084108 0.08293253
8  versicolor          4.9         1.0 5.471005 0.11998077 -0.5710051 0.07546185 0.4328038 0.07544526
9      setosa          5.8         0.2 4.963212 0.05287342  0.8367879 0.02388828 0.3228858 0.07500610
10 versicolor          6.0         1.0 5.471005 0.11998077  0.5289949 0.07546185 0.4340307 0.06475225
# ... with 140 more rows, and 1 more variables: .std.resid <dbl>

使用该autoplot方法，我们可以制作显示 Cook 距离值的信息诊断图，并帮助我们快速识别对模型参数有巨大影响的数据点：

# plot model diagnostics
library(ggplot2)
library(ggfortify)
diagnostic_plots_df <- 
  xy %>%  
  mutate(diagnostic_plots = map(model, 
                                ~autoplot(., 
                                          which = 1:6, 
                                          ncol = 3, 
                                          label.size = 3)))

这只是制作的地块之一：

> diagnostic_plots_df[[1]]

现在，使用贝叶斯线性模型，我们可以类似地计算数据框中每个组的线性模型：

# Bayesian linear model diagnostics
library(rstanarm)
bayes_xy <- 
  iris %>% 
  nest(-Species) %>% 
  mutate(model = map(data, 
                     ~stan_lm(Sepal.Length ~ Petal.Width, 
                         data = .,
                         prior = NULL, 
                         chains = 1, 
                         cores = 2, 
                         seed = 1)),
         fit =  map(model, augment))

> bayes_xy
# A tibble: 3 × 4
     Species              data         model                   fit
      <fctr>            <list>        <list>                <list>
1     setosa <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
2 versicolor <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
3  virginica <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>

但是该broom::augment方法没有库克的距离值：

# inspect fit diagnostics
bayes_xy %>% unnest(fit)

# A tibble: 150 × 6
   Species Sepal.Length Petal.Width  .fitted    .se.fit      .resid
    <fctr>        <dbl>       <dbl>    <dbl>      <dbl>       <dbl>
1   setosa          5.1         0.2 4.963968 0.06020298  0.13482025
2   setosa          4.9         0.2 4.963968 0.06020298 -0.06517975
3   setosa          4.7         0.2 4.963968 0.06020298 -0.26517975
4   setosa          4.6         0.2 4.963968 0.06020298 -0.36517975
5   setosa          5.0         0.2 4.963968 0.06020298  0.03482025
6   setosa          5.4         0.4 5.151501 0.11299956  0.21818386
7   setosa          4.6         0.3 5.057734 0.05951488 -0.47349794
8   setosa          5.0         0.2 4.963968 0.06020298  0.03482025
9   setosa          4.4         0.2 4.963968 0.06020298 -0.56517975
10  setosa          4.9         0.1 4.870201 0.11408783  0.04313845
# ... with 140 more rows

而且没有autoplot方法：

# plot model diagnostics
bayes_diagnostic_plots_df <- 
  bayes_xy %>%  
  mutate(diagnostic_plots = map(model, 
                                ~autoplot(., 
                                          which = 1:6, 
                                          ncol = 3, 
                                          label.size = 3)))

# Error, there doesn't seem to be an autoplot method for stan_lm output

shinystan::launch_shinystan(bayes_xy$model[[1]])

# This is quite interesting, but nothing at the level of specific data points

一些学术文献谈到了模型扰动、phi-divergence、Cook 后验模态距离和 Cook 后验平均距离、Kullback -Leibler 散度等方法。但是我看不到任何用 R 代码探索过的地方，我被困住了。

在 Cross-validated 上有一个关于这个主题的未回答的问题。我在这里发帖是因为我正在寻找关于编写代码来计算影响统计的想法（而不是关于统计理论和方法等的建议，应该针对其他问题）

我怎样才能从输出中得到类似库克距离测量值的东西rstanarm::stan_lm？

score 4 · Accepted Answer

Aki Vehtari 的这篇文章说得最好：

lppd_i 和 loo_i 之间的差异已被用作敏感性测量（例如，参见 Gelfand 等人 1992）。如果 lppd_i 和 loo_i 之间的差异很大，则帕累托形状参数估计 k 可能很大。我还不清楚帕累托形状参数估计 k 是否会比 lppd_i-loo_i 更好，但至少我们知道如果 k 接近 1 或更大，lppd_i-loo_i 的估计太小，所以最好看看k。在使用普通模型的堆栈示例中，一次观察的 k 较大，但对于 student-t 模型，k 较小。Normal 模型与 student-t 模型相同，但在自由度上具有非常强的先验性。因此，这不仅仅是关于具有强大的先验或更多收缩，而是拥有一个可以很好地描述观察结果的模型。随着收缩率的增加和非鲁棒的观察模型，一个观察结果仍然可能令人惊讶。自然，更改为允许“异常值”的更强大的观察模型并不总是最好的解决方案。相反，最好使回归函数更非线性（即具有较弱的先验），或变换协变量，或添加更多协变量。所以我确实建议查看 Pareto 形状参数值，但如果值很大，我不建议增加收缩率。或变换协变量，或添加更多协变量。所以我确实建议查看 Pareto 形状参数值，但如果值很大，我不建议增加收缩率。或变换协变量，或添加更多协变量。所以我确实建议查看 Pareto 形状参数值，但如果值很大，我不建议增加收缩率。

$pareto_k您可以从 loo 包中的函数生成的列表元素中获取帕累托形状参数估计 k，该列表loo由 rstanarm 包重新导出。如果此值高于 0.7（默认情况下），该loo函数将建议您重新拟合模型而忽略此观察，因为后验分布可能对该观察过于敏感，无法满足 LOOIC 的假设，即每个观察对后验分布的影响可以忽略不计。

在 OP 的情况下，只有第七个观测值的帕累托形状参数估计值略大于 0.5，因此观测值可能对后验没有极端影响。但您肯定想调查值大于 1.0 的观测值

您还可以为plotloo 对象调用该方法，尤其是使用非默认选项label_points = TRUE来可视化 Pareto 形状参数估计。

score 1 · Accepted Answer

查看有关stan-users电子邮件列表的一些讨论，我看到loo包的输出包含“每个观察的逐点贡献”。因此，这是与这些合作的尝试：

# Bayesian linear model diagnostics
library(rstanarm)
library(loo)
bayes_xy <- 
  iris %>% 
  nest(-Species) %>% 
  mutate(model = map(data, 
                     ~stan_lm(Sepal.Length ~ Petal.Width, 
                         data = .,
                         prior = NULL, 
                         chains = 1, 
                         cores = 2, 
                         seed = 1)))


bayes_xy_loo <- 
bayes_xy %>% 
  mutate(loo_out = map(model, ~loo(.)))

library(ggplot2)
library(ggrepel)
n <-  5 # how many points to label


my_plots <- 
bayes_xy_loo %>% 
  select(loo_out) %>% 
  mutate(loo_pointwise = map(.$loo_out, ~data.frame(.$pointwise))) %>% 
  mutate(plots = map(.$loo_pointwise, 
      ~ggplot(., 
       aes(elpd_loo,
           looic)) +
       geom_point(aes(size = p_loo)) +
       geom_text_repel(data = .[tail(order(.$looic), n),] ,
                  aes(label = row.names(.[tail(order(.$looic), n),])),
                  nudge_x = -0.1,
                  nudge_y = -0.3) +
        geom_label_repel(data = .[tail(order(.$elpd_loo), n),] ,
                        aes(label = row.names(.[tail(order(.$elpd_loo), n),])),
                        nudge_x = 0.1,
                        nudge_y = 0.1) +
       xlab("Expected log pointwise \npredictive density") +
       ylab("LOO information \ncriterion") +
       scale_size_area(name = "Estimated \neffective\nnumber \nof parameters") +
       theme_minimal(base_size = 10)))

do.call(gridExtra::grid.arrange, my_plots$plots)

然而，被建议有影响的点并不是一个很好的匹配。例如，在问题中我们有 obs。7、15 和 30 具有高库克距离值。在loo输出中，似乎只有obs。15 像往常一样被识别。所以也许这不是正确的方法。

r - 使用 rstan 对贝叶斯模型进行线性模型诊断

2 回答 2

Related

Reference