我有一个数据集,看起来像来自正态分布的样本,有几个独立的影响移动特定数据点。应用于任何点的影响是已知的,需要从数据中恢复这些变化。我应该如何重新参数化这个模型以使其适合数据?
这是一个玩具数据集生成器:
import os
os.environ["MKL_THREADING_LAYER"] = "GNU"
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%config InlineBackend.figure_format='retina'
import seaborn as sns
import pymc3 as pm
import arviz as az
import theano.tensor as tt
import theano as T
np.random.seed(0)
numeff = 2
numdata = 10000
shifts = []
index = []
for i in range(numeff):
n = np.random.randint(2,4)
s = np.random.uniform(size=n)
s -= np.mean(s) #keep each effect shifts centered
shifts.append(s)
data = np.random.normal(0.0,0.03,size=numdata)
for s in shifts:
idx = np.random.randint(0,len(s),size=numdata)
index.append(idx)
data += s[idx]
plt.hist(data,100)
print(shifts)
[数组([-0.12571057,0.12571057]),数组([0.00673915,-0.11448923,0.10775008])]
我正在尝试使用 pymc3 对这些数据进行建模,如下所示:
coords = {}
for i in range(numeff):
name = "effect_{}".format(i)
value = np.arange(len(shifts[i]))
coords[name]=value
with pm.Model(coords=coords) as model:
idx = pm.Data("idx", np.array(index))
theta = tt.zeros(numdata)
for i in range(numeff):
dimname = "effect_{}".format(i)
mu = pm.Normal("mu_{}".format(i), mu=0.0, sigma=10.0,dims=(dimname))
mu -= mu.mean(keepdims=True)
theta += mu[idx[i,:]]
sigma = pm.Exponential('sigma',1.0)
obs = pm.Normal('obs',theta,sigma=sigma,observed=data)
pm.model_to_graphviz(model)
后采样步骤:
with model:
idata = pm.sample(draws=1000, tune=1000,chains=4,cores=4,return_inferencedata=True,target_accept=0.99)
我有一个非常缓慢且行为不端的采样器:
Auto-assigning NUTS sampler...
Initializing NUTS using jitter+adapt_diag...
Multiprocess sampling (4 chains in 4 jobs)
NUTS: [sigma, mu_1, mu_0]
100.00% [8000/8000 09:59<00:00 Sampling 4 chains, 0 divergences]
Sampling 4 chains for 1_000 tune and 1_000 draw iterations (4_000 + 4_000 draws total) took 599 seconds.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The chain reached the maximum tree depth. Increase max_treedepth, increase target_accept or reparameterize.
The rhat statistic is larger than 1.4 for some parameters. The sampler did not converge.
The estimated number of effective samples is smaller than 200 for some parameters.
az.plot_trace(idata)
有没有更好的方法来设计这样的模型?在一个理想的世界里,我应该有 3 种不同的影响来驱动数据……但是以我目前的模型设计,这是没有希望的。