python - 最高后密度区和中央可信区

Question

给定一些参数 Θ 的后验 p(Θ|D)，可以定义以下内容：

最高后密度区域：

最高后密度区域是一组最可能的 Θ 值，总共构成后质量的 100(1-α)%。

换句话说，对于给定的 α，我们寻找满足：

在此处输入图像描述

然后获得最高后密度区域作为集合：

在此处输入图像描述

中央可信区：

使用与上述相同的符号，可信区域（或区间）定义为：

在此处输入图像描述

根据分布，可能有许多这样的间隔。中央可信区间定义为每个尾部有(1-α)/2质量的可信区间。

计算：

对于一般分布，给定分布中的样本，是否有任何内置函数可以在 Python 或PyMC中获得上述两个量？
对于常见的参数分布（例如 Beta、Gaussian 等），是否有任何内置函数或库可以使用SciPy或statsmodels进行计算？

score 23 · Accepted Answer

据我了解，“中央可信区域”与计算置信区间的方式没有任何不同；你所需要的只是在and处的cdf函数的倒数；在这被称为（百分比函数）；对于高斯后验分布：alpha/21-alpha/2scipyppf

>>> from scipy.stats import norm
>>> alpha = .05
>>> l, u = norm.ppf(alpha / 2), norm.ppf(1 - alpha / 2)

验证后密度的[l, u]覆盖：(1-alpha)

>>> norm.cdf(u) - norm.cdf(l)
0.94999999999999996

同样对于 Beta 后验与 saya=1和b=3：

>>> from scipy.stats import beta
>>> l, u = beta.ppf(alpha / 2, a=1, b=3), beta.ppf(1 - alpha / 2, a=1, b=3)

然后再次：

>>> beta.cdf(u, a=1, b=3) - beta.cdf(l, a=1, b=3)
0.94999999999999996

在这里您可以看到 scipy 中包含的参数分布；我猜它们都有ppf功能；

至于最高后验密度区域，则比较棘手，因为pdf函数不一定是可逆的；通常这样的区域甚至可能没有连接；例如，在 Beta 的情况下a = b = .5（如这里所示）；

但是，在高斯分布的情况下，很容易看出“最高后密度区域”与“中心可信区域”重合；我认为这是所有对称单峰分布的情况（即，如果 pdf 函数围绕分布模式对称）

对于一般情况，一种可能的数值方法是对p*使用的数值积分的值进行二分搜索pdf；利用积分是的单调函数这一事实p*；

这是混合高斯的示例：

[ 1 ]你需要的第一件事是分析 pdf 函数；对于混合高斯很容易：

def mix_norm_pdf(x, loc, scale, weight):
    from scipy.stats import norm
    return np.dot(weight, norm.pdf(x, loc, scale))

例如对于位置、比例和重量值，如

loc    = np.array([-1, 3])   # mean values
scale  = np.array([.5, .8])  # standard deviations
weight = np.array([.4, .6])  # mixture probabilities

你会得到两个很好的高斯分布手牵手：

在此处输入图像描述

[ 2 ]p*现在，您需要一个误差函数，它给出了上面集成 pdf 函数的测试值，p*并从所需值返回平方误差1 - alpha：

def errfn( p, alpha, *args):
    from scipy import integrate
    def fn( x ):
        pdf = mix_norm_pdf(x, *args)
        return pdf if pdf > p else 0

    # ideally integration limits should not
    # be hard coded but inferred
    lb, ub = -3, 6 
    prob = integrate.quad(fn, lb, ub)[0]
    return (prob + alpha - 1.0)**2

[ 3 ]现在，对于给定的值，alpha我们可以最小化误差函数来获得p*：

alpha = .05

from scipy.optimize import fmin
p = fmin(errfn, x0=0, args=(alpha, loc, scale, weight))[0]

这导致p* = 0.0450, 和 HPD 如下; 红色区域代表1 - alpha分布，水平虚线为p*。

在此处输入图像描述

score 16 · Accepted Answer

要计算 HPD，您可以利用 pymc3，这是一个示例

import pymc3
from scipy.stats import norm
a = norm.rvs(size=10000)
pymc3.stats.hpd(a)

score 11 · Accepted Answer

另一个选项（从 R 改编为 Python）并取自 John K. Kruschke 的《做贝叶斯数据分析》一书）如下：

from scipy.optimize import fmin
from scipy.stats import *

def HDIofICDF(dist_name, credMass=0.95, **args):
    # freeze distribution with given arguments
    distri = dist_name(**args)
    # initial guess for HDIlowTailPr
    incredMass =  1.0 - credMass

    def intervalWidth(lowTailPr):
        return distri.ppf(credMass + lowTailPr) - distri.ppf(lowTailPr)

    # find lowTailPr that minimizes intervalWidth
    HDIlowTailPr = fmin(intervalWidth, incredMass, ftol=1e-8, disp=False)[0]
    # return interval as array([low, high])
    return distri.ppf([HDIlowTailPr, credMass + HDIlowTailPr])

这个想法是创建一个函数intervalWidth，它返回从 lowTailPr 开始并具有credMass质量的区间宽度。intervalWidth 函数的最小值是通过使用scipy中的fmin最小化器建立的。

例如结果：

print HDIofICDF(norm, credMass=0.95, loc=0, scale=1)

是

    [-1.95996398  1.95996398]

传递给 HDIofICDF 的分布参数的名称必须与 scipy 中使用的完全相同。

score 9 · Accepted Answer

PyMC 有一个用于计算 hpd 的内置函数。在 v2.3 中，它位于 utils 中。请参阅此处的来源。作为线性模型的示例，它是 HPD

import pymc as pc  
import numpy as np
import matplotlib.pyplot as plt 
## data
np.random.seed(1)
x = np.array(range(0,50))
y = np.random.uniform(low=0.0, high=40.0, size=50)
y = 2*x+y
## plt.scatter(x,y)

## priors
emm = pc.Uniform('m', -100.0, 100.0, value=0)
cee = pc.Uniform('c', -100.0, 100.0, value=0) 

#linear-model
@pc.deterministic(plot=False)
def lin_mod(x=x, cee=cee, emm=emm):
    return emm*x + cee 

#likelihood
llhy = pc.Normal('y', mu=lin_mod, tau=1.0/(10.0**2), value=y, observed=True)

linearModel = pc.Model( [llhy, lin_mod, emm, cee] )
MCMClinear = pc.MCMC( linearModel)
MCMClinear.sample(10000,burn=5000,thin=5)
linear_output=MCMClinear.stats()

## pc.Matplot.plot(MCMClinear)
## print HPD using the trace of each parameter 
print(pc.utils.hpd(MCMClinear.trace('m')[:] , 1.- 0.95))
print(pc.utils.hpd(MCMClinear.trace('c')[:] , 1.- 0.95))

您也可以考虑计算分位数

print(linear_output['m']['quantiles'])
print(linear_output['c']['quantiles'])

我认为如果你只取 2.5% 到 97.5% 的值，你就会得到 95% 的中央可信区间。

score 7 · Accepted Answer

我偶然发现了这篇文章，试图找到一种从 MCMC 样本中估计 HDI 的方法，但没有一个答案对我有用。像 aloctavodia 一样，我将《做贝叶斯数据分析》一书中的 R 示例改编为 Python。我需要从 MCMC 样本中计算出 95% 的 HDI。这是我的解决方案：

import numpy as np
def HDI_from_MCMC(posterior_samples, credible_mass):
    # Computes highest density interval from a sample of representative values,
    # estimated as the shortest credible interval
    # Takes Arguments posterior_samples (samples from posterior) and credible mass (normally .95)
    sorted_points = sorted(posterior_samples)
    ciIdxInc = np.ceil(credible_mass * len(sorted_points)).astype('int')
    nCIs = len(sorted_points) - ciIdxInc
    ciWidth = [0]*nCIs
    for i in range(0, nCIs):
    ciWidth[i] = sorted_points[i + ciIdxInc] - sorted_points[i]
    HDImin = sorted_points[ciWidth.index(min(ciWidth))]
    HDImax = sorted_points[ciWidth.index(min(ciWidth))+ciIdxInc]
    return(HDImin, HDImax)

上面的方法是根据我拥有的数据给我合乎逻辑的答案！

score 2 · Accepted Answer

您可以通过两种方式获得中央可信区间：从图形上看，当您调用summary_plot模型中的变量时，默认情况下bpd会设置一个标志。True将其更改为False将绘制中心间隔。第二个你可以得到它的地方是当你summary在你的模型或节点上调用方法时；它将为您提供后分位数，默认情况下外部分位数将是 95% 的中心间隔（您可以使用alpha参数进行更改）。

score 1 · Accepted Answer

在`R`你可以使用`stat.extend`包

如果您正在处理标准参数分布，并且您不介意使用R，那么您可以使用stat.extend包中的 HDR 函数。此软件包具有适用于所有基本发行版和扩展软件包中的某些发行版的 HDR 功能。它使用分布的分位数函数计算 HDR，并自动调整分布的形状（例如，单峰、双峰等）。以下是使用此软件包为标准参数分布计算的 HDR 的一些示例。

#Load library
library(stat.extend)

#---------------------------------------------------------------
#Compute HDR for gamma distribution
HDR.gamma(cover.prob = 0.9, shape = 3, scale = 4)

        Highest Density Region (HDR) 
 
90.00% HDR for gamma distribution with shape = 3 and scale = 4 
Computed using nlm optimisation with 6 iterations (code = 1) 

[1.76530758147504, 21.9166988492762]

#---------------------------------------------------------------
#Compute HDR for (unimodal) beta distribution
HDR.beta(cover.prob = 0.9, shape1 = 3.2, shape2 = 3.0)

        Highest Density Region (HDR) 
 
90.00% HDR for beta distribution with shape1 = 3.2 and shape2 = 3 
Computed using nlm optimisation with 4 iterations (code = 1) 

[0.211049233508331, 0.823554556452285]

#---------------------------------------------------------------
#Compute HDR for (bimodal) beta distribution
HDR.beta(cover.prob = 0.9, shape1 = 0.3, shape2 = 0.4)

        Highest Density Region (HDR) 
 
90.00% HDR for beta distribution with shape1 = 0.3 and shape2 = 0.4 
Computed using nlm optimisation with 6 iterations (code = 1) 

[0, 0.434124342324438] U [0.640580807770818, 1]

python - 最高后密度区和中央可信区

最高后密度区域：

中央可信区：

计算：

7 回答 7

在R你可以使用stat.extend包

Related

Reference

在`R`你可以使用`stat.extend`包