我正在寻找一种有效的方法来识别对线性模型的参数有巨大影响的数据点。这对于普通的线性模型来说是直截了当的,但我不确定如何使用贝叶斯线性模型来做到这一点。
这是使用普通线性模型的一种方法,我们可以计算每个数据点的库克距离,并绘制包含库克距离的诊断图:
# ordinary linear model diagnostics, similar to my use-case
library(dplyr)
library(purrr)
library(tidyr)
library(broom)
# make linear models for each type of species
xy <-
iris %>%
nest(-Species) %>%
mutate(model = map(data,
~lm(Sepal.Length ~ Petal.Width,
data = .)),
fit = map(model, augment))
这里我们有一个带有嵌套列表的数据框,该model
列包含每个物种的线性模型:
> xy
# A tibble: 3 × 4
Species data model fit
<fctr> <list> <list> <list>
1 setosa <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
2 versicolor <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
3 virginica <tibble [50 × 4]> <S3: lm> <data.frame [50 × 9]>
该broom::augment
函数允许我们将每个数据点的 Cook 距离值添加到该数据帧中,我们可以像这样检查它们:
# inspect Cook's distance values
xy %>%
unnest(fit) %>%
arrange(desc(.cooksd))
# A tibble: 150 × 10
Species Sepal.Length Petal.Width .fitted .se.fit .resid .hat .sigma .cooksd
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 versicolor 5.9 1.8 6.612097 0.16181001 -0.7120969 0.13725081 0.4269862 0.24507448
2 setosa 5.0 0.6 5.335281 0.17114108 -0.3352811 0.25027563 0.3410686 0.21385214
3 virginica 4.9 1.7 6.375829 0.13613717 -1.4758292 0.04875277 0.5826838 0.15434787
4 setosa 5.7 0.4 5.149247 0.08625887 0.5507534 0.06357957 0.3355980 0.09396588
5 setosa 4.3 0.1 4.870195 0.08321347 -0.5701948 0.05916942 0.3349111 0.09285408
6 virginica 5.8 2.4 6.831411 0.14828703 -1.0314106 0.05784319 0.6035012 0.09117693
7 virginica 7.2 1.6 6.310746 0.16207266 0.8892538 0.06909799 0.6084108 0.08293253
8 versicolor 4.9 1.0 5.471005 0.11998077 -0.5710051 0.07546185 0.4328038 0.07544526
9 setosa 5.8 0.2 4.963212 0.05287342 0.8367879 0.02388828 0.3228858 0.07500610
10 versicolor 6.0 1.0 5.471005 0.11998077 0.5289949 0.07546185 0.4340307 0.06475225
# ... with 140 more rows, and 1 more variables: .std.resid <dbl>
使用该autoplot
方法,我们可以制作显示 Cook 距离值的信息诊断图,并帮助我们快速识别对模型参数有巨大影响的数据点:
# plot model diagnostics
library(ggplot2)
library(ggfortify)
diagnostic_plots_df <-
xy %>%
mutate(diagnostic_plots = map(model,
~autoplot(.,
which = 1:6,
ncol = 3,
label.size = 3)))
这只是制作的地块之一:
> diagnostic_plots_df[[1]]
现在,使用贝叶斯线性模型,我们可以类似地计算数据框中每个组的线性模型:
# Bayesian linear model diagnostics
library(rstanarm)
bayes_xy <-
iris %>%
nest(-Species) %>%
mutate(model = map(data,
~stan_lm(Sepal.Length ~ Petal.Width,
data = .,
prior = NULL,
chains = 1,
cores = 2,
seed = 1)),
fit = map(model, augment))
> bayes_xy
# A tibble: 3 × 4
Species data model fit
<fctr> <list> <list> <list>
1 setosa <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
2 versicolor <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
3 virginica <tibble [50 × 4]> <S3: stanreg> <data.frame [50 × 5]>
但是该broom::augment
方法没有库克的距离值:
# inspect fit diagnostics
bayes_xy %>% unnest(fit)
# A tibble: 150 × 6
Species Sepal.Length Petal.Width .fitted .se.fit .resid
<fctr> <dbl> <dbl> <dbl> <dbl> <dbl>
1 setosa 5.1 0.2 4.963968 0.06020298 0.13482025
2 setosa 4.9 0.2 4.963968 0.06020298 -0.06517975
3 setosa 4.7 0.2 4.963968 0.06020298 -0.26517975
4 setosa 4.6 0.2 4.963968 0.06020298 -0.36517975
5 setosa 5.0 0.2 4.963968 0.06020298 0.03482025
6 setosa 5.4 0.4 5.151501 0.11299956 0.21818386
7 setosa 4.6 0.3 5.057734 0.05951488 -0.47349794
8 setosa 5.0 0.2 4.963968 0.06020298 0.03482025
9 setosa 4.4 0.2 4.963968 0.06020298 -0.56517975
10 setosa 4.9 0.1 4.870201 0.11408783 0.04313845
# ... with 140 more rows
而且没有autoplot
方法:
# plot model diagnostics
bayes_diagnostic_plots_df <-
bayes_xy %>%
mutate(diagnostic_plots = map(model,
~autoplot(.,
which = 1:6,
ncol = 3,
label.size = 3)))
# Error, there doesn't seem to be an autoplot method for stan_lm output
shinystan::launch_shinystan(bayes_xy$model[[1]])
# This is quite interesting, but nothing at the level of specific data points
一些学术文献谈到了模型扰动、phi-divergence、Cook 后验模态距离和 Cook 后验平均距离、Kullback -Leibler 散度等方法。但是我看不到任何用 R 代码探索过的地方,我被困住了。
在 Cross-validated 上有一个关于这个主题的未回答的问题。我在这里发帖是因为我正在寻找关于编写代码来计算影响统计的想法(而不是关于统计理论和方法等的建议,应该针对其他问题)
我怎样才能从输出中得到类似库克距离测量值的东西rstanarm::stan_lm
?