python - 对直方图进行采样，使得样本上的总和是一致的

Question

我有一个项目列表，我想从中随机抽取一个子集，但每个项目都与 D 个 bin 上的直方图配对，我想以总和直方图近似一致的方式对项目进行抽样。

因此它应该作为下面的示例函数工作：

>>> import numpy
>>> #The histograms from which to sample (each having 5 bins):
>>> data = numpy.random.randint(100, size=(10000,5))
>>> #The function which I'm trying to program:
>>> samples = sample(data,500)
>>> samples.shape
(500,5)
>>> summed_histogram = samples.sum(axis=0)
>>> #Each bin should have approximately equal value
>>> summed_histogram / float(summed_histogram.sum())
array([ 0.2,  0.2,  0.2,  0.2,  0.2])

总和直方图的绝对值并不重要，也不需要完全一致，只需大致一致即可。另外，我不在乎返回的样本量是否不完全是指定的样本量。取样应无更换。

score 2 · Accepted Answer

要扩展@Ilmari Karonen 的解决方案，您要做的是计算每个直方图的权重，然后根据这些权重进行采样。在我看来，根据您的目标，最有效的方法是使用线性程序。

令 D_ij 为第 i 个项目的直方图中第 j 个 bin 的权重。然后，如果每个项目的权重为 w_i，则“总和直方图”的权重 sum(i in items) w_i D_ij。获得“近似均匀”分布的一种方法是最小化箱之间的最大差异，因此我们将解决以下 LP：

minimize z
subject to (for all j, k) 
    z >= (sum i in items) w_i D_ij - (sum i in items) w_i D_ik
    z >= (sum i in items) w_i D_ik - (sum i in items) w_i D_ij

The above is basically saying that z >= absolute value of difference across all weighted pairs of bins. To solve this LP you will need a separate package since numpy does not include a LP solver. See this gist for a solution using cplex or this gist for a solution using cvxpy. Note that you will need to set some constraints on the weights (e.g. each weight is larger or equal to 0) as these solutions do. Other python bindings for GLPK (GNU Linear Programming kit) can be found here: http://en.wikibooks.org/wiki/GLPK/Python.

Finally you just sample from histogram i with weight w_i. This can be done with an adaptation of roulette selection using cumsum and searchsorted as suggested by @Ilmari Karonen, see this gist.

If you wanted the resulting weighted distribution to be "as uniform as possible", I would solve a similar problem with weights, but maximize the weighted entropy across the weighted sum of bins. This problem would appear to be nonlinear although you could use any number of nonlinear solvers such as BFGS or gradient-based methods. This would probably be a bit slower than the LP method but it depends on what you need in your application. The LP method would approximate the nonlinear method very closely if you have a large number of histograms, because it would be easy to reach a uniform distribution.

When using the LP solution, a bunch of the histogram weights may bind to 0 because the number of constraints is small, but this will not be a problem with a non-trivial number of bins, since the number of constraints is O(n^2).

Example weights with 50 histograms and 10 bins:

[0.006123642775837011, 0.08591660144140816, 0.0, 0.0, 0.0, 0.0, 0.03407525280610657, 0.0, 0.0, 0.0, 0.07092537493489116, 0.0, 0.0, 0.023926802333318554, 0.0, 0.03941537854267549, 0.0, 0.0, 0.0, 0.0, 0.10937063438351756, 0.08715770469631079, 0.0, 0.05841899435928017, 0.016328676622408153, 0.002218517959171183, 0.0, 0.0, 0.0, 0.08186919626269101, 0.03173286609277701, 0.08737065271898292, 0.0, 0.0, 0.041505225727435785, 0.05033635148761689, 0.0, 0.09172214842175723, 0.027548495513552738, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0259929997624099, 0.0, 0.0, 0.028044483157851748, 0.0, 0.0, 0.0]

With 50 histograms each with 50 bins, now very few zero values:

[0.0219136051655165, 0.0, 0.028325808078797768, 0.0, 0.040889043180965624, 0.04372501089775975, 0.0, 0.031032870504105477, 0.020745831040881676, 0.04794861828714149, 0.0, 0.03763592540998652, 0.0029093177405377577, 0.0034239051136138398, 0.0, 0.03079554151573207, 0.0, 0.04676278554085836, 0.0461258666541918, 9.639105313353352e-05, 0.0, 0.013649362063473166, 0.059168272186891635, 0.06703936360466661, 0.0, 0.0, 0.03175895249795131, 0.0, 0.0, 0.04376133487616099, 0.02406633433758186, 0.009724226721798858, 0.05058252335384487, 0.0, 0.0393763638188805, 0.05287112817101315, 0.0, 0.0, 0.06365320629437914, 0.0, 0.024978299494456246, 0.023531082497830605, 0.033406648550332804, 0.012693750980220679, 0.00274892002684083, 0.0, 0.0, 0.0, 0.0, 0.04465971034045478, 4.888224154453002]

score 0 · Accepted Answer

你能抽取一些完整的随机样本（500 个），然后选择最均匀的一个（即最低的sample.sum(axis=0).std()）吗？这避免了在绘制增量样本时出现奇怪的偏差。

python - 对直方图进行采样，使得样本上的总和是一致的

2 回答 2

Related

Reference