algorithm - 表示连续概率分布

Question

我有一个涉及连续概率分布函数集合的问题，其中大部分是根据经验确定的（例如出发时间、运输时间）。我需要的是获取其中两个 PDF 并对它们进行算术运算的某种方法。例如，如果我有两个取自 PDF X 的值 x 和取自 PDF Y 的 y，我需要获取 (x+y) 或任何其他操作 f(x,y) 的 PDF。

分析解决方案是不可能的，所以我正在寻找的是允许这些事情的 PDF 的一些表示。一个明显（但计算成本高）的解决方案是蒙特卡罗：生成大量 x 和 y 值，然后仅测量 f(x, y)。但这需要太多的 CPU 时间。

我确实考虑将 PDF 表示为范围列表，其中每个范围具有大致相等的概率，有效地将 PDF 表示为均匀分布列表的并集。但我看不出如何将它们结合起来。

有人对这个问题有什么好的解决方案吗？

编辑：目标是创建一种用于操作 PDF 的迷你语言（又名域特定语言）。但首先我需要理清底层的表示和算法。

编辑 2： dmckee 建议使用直方图实现。这就是我对统一分布列表的理解。但我不知道如何将它们结合起来创建新的发行版。最终我需要找到像 P(x < y) 这样的东西，以防它可能非常小。

编辑 3：我有一堆直方图。它们不是均匀分布的，因为我是从发生数据生成它们的，所以基本上如果我有 100 个样本并且我想要直方图中的 10 个点，那么我为每个条分配 10 个样本，并使条的宽度可变但面积不变。

我已经发现要添加 PDF，您需要对它们进行卷积，并且我已经为此做好了数学准备。当你对两个均匀分布进行卷积时，你会得到一个包含三个部分的新分布：较宽的均匀分布仍然存在，但每边都有一个三角形，其宽度与较窄的分布相同。因此，如果我对 X 和 Y 的每个元素进行卷积，我会得到一堆这些，全部重叠。现在我试图弄清楚如何将它们全部相加，然后得到一个最接近它的直方图。

我开始怀疑蒙特卡洛到底是不是一个坏主意。

编辑 4： 本文详细讨论了均匀分布的卷积。一般来说，你会得到一个“梯形”分布。由于直方图中的每个“列”都是均匀分布，我希望可以通过对这些列进行卷积并对结果求和来解决问题。

然而，结果比输入复杂得多，并且还包括三角形。 编辑 5： [删除了错误的内容]。但是，如果这些梯形近似为具有相同面积的矩形，那么您会得到正确的答案，并且减少结果中的矩形数量看起来也很简单。这可能是我一直在寻找的解决方案。

编辑6：解决！这是这个问题的最终 Haskell 代码：

-- | Continuous distributions of scalars are represented as a
-- | histogram where each bar has approximately constant area but
-- | variable width and height.  A histogram with N bars is stored as
-- | a list of N+1 values.
data Continuous = C {
      cN :: Int,
      -- ^ Number of bars in the histogram.
      cAreas :: [Double],
      -- ^ Areas of the bars.  @length cAreas == cN@
      cBars :: [Double]
      -- ^ Boundaries of the bars.  @length cBars == cN + 1@
    } deriving (Show, Read)


{- | Add distributions.  If two random variables @vX@ and @vY@ are
taken from distributions @x@ and @y@ respectively then the
distribution of @(vX + vY)@ will be @(x .+. y).

This is implemented as the convolution of distributions x and y.
Each is a histogram, which is to say the sum of a collection of
uniform distributions (the "bars").  Therefore the convolution can be
computed as the sum of the convolutions of the cross product of the
components of x and y.

When you convolve two uniform distributions of unequal size you get a
trapezoidal distribution. Let p = p2-p1, q - q2-q1.  Then we get:


>   |                              |
>   |     ______                   |
>   |     |    |           with    |  _____________
>   |     |    |                   |  |           |
>   +-----+----+-------            +--+-----------+-
>         p1   p2                     q1          q2
> 
>  gives    h|....... _______________
>            |       /:             :\
>            |      / :             : \                1
>            |     /  :             :  \     where h = -
>            |    /   :             :   \              q
>            |   /    :             :    \
>            +--+-----+-------------+-----+-----
>             p1+q1  p2+q1       p1+q2   p2+q2

However we cannot keep the trapezoid in the final result because our
representation is restricted to uniform distributions.  So instead we
store a uniform approximation to the trapezoid with the same area:

>           h|......___________________
>            |     | /               \ |
>            |     |/                 \|
>            |     |                   |
>            |    /|                   |\
>            |   / |                   | \
>            +-----+-------------------+--------
>               p1+q1+p/2          p2+q2-p/2

-}
(.+.) :: Continuous -> Continuous -> Continuous
c .+. d = C {cN     = length bars - 1,
             cBars  = map fst bars, 
             cAreas = zipWith barArea bars (tail bars)}
    where
      -- The convolve function returns a list of two (x, deltaY) pairs.
      -- These can be sorted by x and then sequentially summed to get
      -- the new histogram.  The "b" parameter is the product of the
      -- height of the input bars, which was omitted from the diagrams
      -- above.
      convolve b c1 c2 d1 d2 =
          if (c2-c1) < (d2-d1) then convolve1 b c1 c2 d1 d2 else convolve1 b d1 
d2 c1 c2
      convolve1 b p1 p2 q1 q2 = 
          [(p1+q1+halfP, h), (p2+q2-halfP, (-h))]
               where 
                 halfP = (p2-p1)/2
                 h = b / (q2-q1)
      outline = map sumGroup $ groupBy ((==) `on` fst) $ sortBy (comparing fst) 
$ concat
                [convolve (areaC*areaD) c1 c2 d1 d2 |
                 (c1, c2, areaC) <- zip3 (cBars c) (tail $ cBars c) (cAreas c),
                 (d1, d2, areaD) <- zip3 (cBars d) (tail $ cBars d) (cAreas d)
                ]
      sumGroup pairs = (fst $ head pairs, sum $ map snd pairs)

      bars = tail $ scanl (\(_,y) (x2,dy) -> (x2, y+dy)) (0, 0) outline
      barArea (x1, h) (x2, _) = (x2 - x1) * h

其他运算符留给读者练习。

score 16 · Accepted Answer

不需要直方图或符号计算：如果采取正确的观点，一切都可以在语言级别以封闭形式完成。

[我将交替使用术语“度量”和“分布”。另外，我的 Haskell 生锈了，请您原谅我在这方面的不精确。]

概率分布实际上是codata。

设 mu 为概率测度。您可以对度量做的唯一事情是将其与测试函数集成（这是“度量”的一种可能的数学定义）。请注意，这是您最终将要做的：例如，针对身份进行整合就是取平均值：

mean :: Measure -> Double
mean mu = mu id

另一个例子：

variance :: Measure -> Double
variance mu = (mu $ \x -> x ^ 2) - (mean mu) ^ 2

另一个例子，它计算 P(mu < x)：

cdf :: Measure -> Double -> Double
cdf mu x = mu $ \z -> if z < x then 1 else 0

这表明了一种二元性方法。

因此类型Measure应表示类型(Double -> Double) -> Double。这允许您对 MC 模拟的结果、针对 PDF 的数值/符号求积等进行建模。例如，函数

empirical :: [Double] -> Measure
empirical h:t f = (f h) + empirical t f

返回 f 对通过例如获得的经验测量的积分。MC 采样。还

from_pdf :: (Double -> Double) -> Measure
from_pdf rho f = my_favorite_quadrature_method rho f

从（常规）密度构建度量。

现在，好消息。如果 mu 和 nu 是两个度量，则卷积 mu ** nu由下式给出：

(mu ** nu) f = nu $ \y -> (mu $ \x -> f $ x + y)

因此，给定两个度量，您可以针对它们的卷积进行积分。

此外，给定 law 的随机变量 X，mua * X 的定律由下式给出：

rescale :: Double -> Measure -> Measure
rescale a mu f = mu $ \x -> f(a * x)

此外，在我们的框架中，phi(X) 的分布由图像度量phi_* X 给出：

apply :: (Double -> Double) -> Measure -> Measure
apply phi mu f = mu $ f . phi

因此，现在您可以轻松地制定出用于度量的嵌入式语言。这里还有很多事情要做，特别是关于除实线之外的样本空间、随机变量之间的依赖关系、条件条件，但我希望你明白这一点。

特别是，前推是功能性的：

newtype Measure a = (a -> Double) -> Double
instance Functor Measure a where
    fmap f mu = apply f mu

它也是一个 monad（练习——提示：这很像 continuation monad。什么是return？什么是 ? 的类比call/cc）。

此外，结合微分几何框架，这可能会变成自动计算贝叶斯后验分布的东西。

在一天结束的时候，你可以写一些东西，比如

m = mean $ apply cos ((from_pdf gauss) ** (empirical data))

计算 cos(X + Y) 的平均值，其中 X 具有 pdfgauss且 Y 已通过 MC 方法进行采样，其结果在data.

score 6 · Accepted Answer

概率分布形成一个单子；参见Claire Jones 的工作以及 LICS 1989 年的论文，但这些想法可以追溯到 Giry 1982 年的一篇论文（DOI 10.1007/BFb0092872）和 Lawvere 1962 年的一篇我无法追踪的笔记（http://permalink. gmane.org/gmane.science.mathematics.categories/6541）。

但我没有看到共素：没有办法从“(a->Double)->Double”中得到“a”。也许如果你把它变成多态的 - (a->r)->r for all r？（那是延续单子。）

score 2 · Accepted Answer

我为我的论文研究了类似的问题。

计算近似卷积的一种方法是对密度函数（在这种情况下为直方图）进行傅里叶变换，将它们相乘，然后进行傅里叶逆变换得到卷积。

查看我论文的附录 C，了解有关概率分布的各种特殊操作情况的公式。您可以在以下网址找到该论文：http ://riso.sourceforge.net

我编写了 Java 代码来执行这些操作。您可以在以下位置找到代码：https ://sourceforge.net/projects/riso

score 2 · Accepted Answer

有什么阻止你为此使用迷你语言的吗？

我的意思是，定义一种语言，让您可以像书面一样为您编写f = x + y和评估。f同样对于g = x * z,h = y(x)等令人作呕。（我建议的语义要求评估者在评估时在 RHS 上出现的每个最里面的 PDF 上选择一个随机数，而不是试图理解生成的 PDF 的堆肥形式。这可能不够快.. .)

假设您了解所需的精度限制，您可以用直方图或样条曲线相当简单地表示 PDF（前者是后者的退化情况）。如果您需要将分析定义的 PDF 与实验确定的 PDF 混合，则必须添加类型机制。

直方图只是一个数组，其内容表示输入范围的特定区域中的发生率。你还没有说你是否有语言偏好，所以我假设一些类似 c 的东西。您需要知道 bin 结构（uniorm 大小很容易，但并不总是最好的），包括上限和下限以及可能的标准化：

struct histogram_struct {
  int bins; /* Assumed to be uniform */
  double low;
  double high;
  /* double normalization; */    
  /* double *errors; */ /* if using, intialize with enough space, 
                         * and store _squared_ errors
                         */
  double contents[];
};

这种事情在科学分析软件中很常见，你可能想使用现有的实现。

score 1 · Accepted Answer

I think the histograms or the list of 1/N area regions is a good idea. For the sake of argument, I'll assume that you'll have a fixed N for all distributions.

Use the paper you linked edit 4 to generate the new distribution. Then, approximate it with a new N-element distribution.

If you don't want N to be fixed, it's even easier. Take each convex polygon (trapezoid or triangle) in the new generated distribution and approximate it with a uniform distribution.

score 1 · Accepted Answer

Another suggestion is to use kernel densities. Especially if you use Gaussian kernels, then they can be relatively easy to work with... except that the distributions quickly explode in size without care. Depending on the application, there are additional approximation techniques like importance sampling that can be used.

score 1 · Accepted Answer

一些初步的想法：

首先，Mathematica 有一个很好的工具可以用精确的分布来做到这一点。

其次，表示为直方图（即经验 PDF）是有问题的，因为您必须选择 bin 大小。这可以通过存储累积分布来避免，即经验 CDF。（事实上，您可以保留重新创建经验分布所基于的完整样本数据集的能力。）

这是一些丑陋的 Mathematica 代码，用于获取样本列表并返回经验 CDF，即值概率对列表。通过 ListPlot 运行此输出以查看经验 CDF 的图。

empiricalCDF[t_] := Flatten[{{#[[2,1]],#[[1,2]]},#[[2]]}&/@Partition[Prepend[Transpose[{#[[1]], Rest[FoldList[Plus,0,#[[2]]]]/Length[t]}&[Transpose[{First[#],Length[#]}&/@ Split[Sort[t]]]]],{Null,0}],2,1],1]

最后，这里有一些关于组合离散概率分布的信息：

http://www.dartmouth.edu/~chance/teaching_aids/books_articles/probability_book/Chapter7.pdf

score 1 · Accepted Answer

自主移动机器人在定位和导航中处理类似的问题，特别是马尔可夫定位和卡尔曼滤波器（传感器融合）。例如，请参阅继续定位方法的实验比较。

您可以从移动机器人那里借鉴的另一种方法是使用势场进行路径规划。

score 1 · Accepted Answer

几个回应：

1）如果您已经根据经验确定了 PDF，则它们要么具有直方图，要么具有参数 PDF 的近似值。PDF 是一个连续函数，您没有无限数据...

2）假设变量是独立的。然后，如果您使 PDF 离散，则 P(f(x,y)) = f(x,y)p(x,y) = f(x,y)p(x)p(y) 对所有组合求和x 和 y 使得 f(x,y) 符合您的目标。

如果您要将经验 PDF 拟合到标准 PDF，例如正态分布，那么您可以使用已经确定的函数来计算总和等。

如果变量不是独立的，那么你手上的麻烦就会更多，我认为你必须使用copulas。

我认为定义自己的迷你语言等是过分的。你可以用数组来做到这一点......

score 0 · Accepted Answer

如果您想要一些乐趣，请尝试像 Maple 或 Mathemetica 那样象征性地表示它们。Maple 使用有向无环图，而 Matematica 使用类似 appoach 的列表/lisp（我相信，但这是一个很长的时间，因为我什至考虑过这一点）。

象征性地进行所有操作，然后最后推动数值。（或者只是找到一种在 shell 中启动并进行计算的方法）。

保罗。

algorithm - 表示连续概率分布

10 回答 10

Related

Reference