python-3.x - 为大型数据集创建 HMC 的自定义梯度函数

Question

我试图在张量流概率中使用 HMC 推断高斯过程的参数。

我有多个从同一底层进程生成的独立数据序列，我想推断它们都共享的内核参数。

为了计算我正在使用急切模式并遍历每个独立序列的可能性。我能够计算可能性，但是在尝试计算梯度时遇到了资源耗尽错误。

我知道这会很慢，但我希望能够将 HMC 与任何大小的数据集一起使用而不会耗尽内存。

我使用以下代码创建合成数据，这会从具有 p 个数据点的 GP 创建 N 个样本。

L = 5
variance=2
m_noise=0.05

kernel=psd_kernels.ExponentiatedQuadratic(np.float64(variance), np.float64(L))


def gram_matrix(xs):
    return kernel.matrix(xs,xs).numpy() + m_noise*np.identity(xs.shape[0])


observation_index_points = []
observations = []
N=200
p = 2000
for i in range(0, N):

    xs = np.sort(np.random.uniform(0,100,p))[...,None]
    mean = [0 for x in xs]
    gram = gram_matrix(xs)

    ys = np.random.multivariate_normal(mean, gram)

    observation_index_points.append(xs)

    observations.append(ys)

for i in range(0, N):

    plt.plot(observation_index_points[i],observations[i])
plt.show()

以下用于计算对数似然度的代码将使用采样器运行小 N 值，但会因 N 值较大（资源耗尽）而失败。尝试计算似然梯度时会发生错误。

@tf.function()
def gp_log_prob(amplitude, length_scale, seg_index_points, noise_variance, seg_observations):

    kernel = psd_kernels.ExponentiatedQuadratic(amplitude, length_scale)
    gp = tfd.GaussianProcess(kernel=kernel,
                                index_points=seg_index_points,
                                observation_noise_variance=noise_variance)


    return gp.log_prob(seg_observations)

rv_amplitude = tfd.LogNormal(np.float64(0.), np.float64(1))
rv_length_scale = tfd.LogNormal(np.float64(0.), np.float64(1))
rv_noise_variance = tfd.LogNormal(np.float64(0.), np.float64(1))

def joint_log_prob_no_grad(amplitude, length_scale, noise_variance):
    ret_val = rv_amplitude.log_prob(amplitude) \
                + rv_length_scale.log_prob(length_scale) \
                + rv_noise_variance.log_prob(noise_variance)

    for i in range(N):
        ret_val = ret_val + gp_log_prob(amplitude, 
                                        length_scale, 
                                        observation_index_points[i], 
                                        noise_variance, 
                                        observations[i])

    return ret_val

但是我可以在循环内使用梯度带计算大 N 的梯度。此代码对任何 N 运行并返回正确的似然性和梯度：

def joint_log_prob(amplitude, length_scale, noise_variance):

    with tf.GradientTape() as tape:
        tape.watch(amplitude)
        tape.watch(length_scale)
        tape.watch(noise_variance)

        ret_val = rv_amplitude.log_prob(amplitude) \
                    + rv_length_scale.log_prob(length_scale) \
                    + rv_noise_variance.log_prob(noise_variance)

    grads = tape.gradient(ret_val, [amplitude, length_scale, noise_variance])

    for i in range(N):
        with tf.GradientTape() as tape:
            tape.watch([amplitude, length_scale, noise_variance])
            gp_prob = gp_log_prob(amplitude, length_scale, 
                                  observation_index_points[i], noise_variance, observations[i])

        gp_grads = tape.gradient(gp_prob, [amplitude, length_scale, noise_variance])


        grads = [a+b for a,b in zip(grads,gp_grads)]
        ret_val = ret_val + gp_prob

    return ret_val, grads

x = tf.convert_to_tensor(np.float64(1.0))
y = tf.convert_to_tensor(np.float64(1.0))
z = tf.convert_to_tensor(np.float64(0.1))

joint_log_prob(x,y,z) # correct output even for large N

如果我然后把它变成一个 customgradient 它再次失败：

@tf.custom_gradient
def joint_log_prob_cg(amplitude, length_scale, noise_variance):

    with tf.GradientTape() as tape:
        tape.watch(amplitude)
        tape.watch(length_scale)
        tape.watch(noise_variance)

        ret_val = rv_amplitude.log_prob(amplitude) \
                    + rv_length_scale.log_prob(length_scale) \
                    + rv_noise_variance.log_prob(noise_variance)

    grads = tape.gradient(ret_val, [amplitude, length_scale, noise_variance])

    for i in range(N):
        with tf.GradientTape() as tape:
            tape.watch([amplitude, length_scale, noise_variance])
            gp_prob = gp_log_prob(amplitude, length_scale, 
                                  observation_index_points[i], noise_variance, observations[i])

        gp_grads = tape.gradient(gp_prob, [amplitude, length_scale, noise_variance])


        grads = [a+b for a,b in zip(grads,gp_grads)]
        ret_val = ret_val + gp_prob

    def grad(dy):
        return grads
    return ret_val, grad

with tf.GradientTape() as t:
    t.watch([x,y,z])
    lp = joint_log_prob_cg(x,y,z)

t.gradient(lp, [x,y,z]) # fails for large N

我的问题是如何grads从joint_log_prob上面的函数（我知道可以为任何大型数据集计算）到 HMC 采样器？似乎如果整个函数都包含在 gradienttape 调用中，那么for循环就会展开并且内存不足 - 但有没有办法解决这个问题？

score 0 · Accepted Answer

如果有人感兴趣，我可以通过使用自定义渐变并停止for循环中的磁带录制来解决这个问题。我需要导入磁带实用程序：

from tensorflow.python.eager import tape

然后停止for循环录制

with tape.stop_recording():
    for i in range(N):
        ...

这将停止跟踪，然后我必须在图形模式下计算梯度并手动添加它们以停止 OOM 错误。

python-3.x - 为大型数据集创建 HMC 的自定义梯度函数

1 回答 1

Related

Reference