algorithm - "Covering" the space of all possible histogram shapes

Question

There is a very expensive computation I must make frequently.

The computation takes a small array of numbers (with about 20 entries) that sums to 1 (i.e. the histogram) and outputs something that I can store pretty easily.

I have 2 things going for me:

I can accept approximate answers
The "answers" change slowly. For example: [.1 .1 .8 0] and [.1 .1 .75 .05] will yield similar results.

Consequently, I want to build a look-up table of answers off-line. Then, when the system is running, I can look-up an approximate answer based on the "shape" of the input histogram.

To be precise, I plan to look-up the precomputed answer that corresponds to the histogram with the minimum Earth-Mover-Distance to the actual input histogram.

I can only afford to store about 80 to 100 precomputed (histogram , computation result) pairs in my look up table.

So, how do I "spread out" my precomputed histograms so that, no matter what the input histogram is, I'll always have a precomputed result that is "close"?

score 3 · Accepted Answer

在M空间中找到作为最佳展开集的N个点或多或少等同于超球面包装( 1 , 2 )，并且通常对于M >10的答案是未知的。尽管已经进行了大量研究来开发更快的超球体填充或近似方法，但它仍然被认为是一个难题。

将诸如主成分分析或因子分析之类的技术应用于尽可能大的一组直方图可能会更好，因为您可以方便地生成。任一分析的结果将是一组M个数字，这样由这些数字加权的直方图数据元素的线性组合将预测一些目标函数。该功能可能是“您可以轻松存储的东西”数字，也可能是案例编号。还可以考虑开发和训练神经网络或使用其他预测建模技术来预测目标函数。

score 1 · Accepted Answer

Building on @jwpat7's answer, I would apply k-means clustering to a huge set of randomly generated (and hopefully representative) histograms. This would ensure that your space was spanned with whatever number of exemplars (precomputed results) you can support, with roughly equal weighting for each cluster.

The trick, of course, will be generating representative data to cluster in the first place. If you can recompute from time to time, you can recluster based on the actual data in the system so that your clusters might get better over time.

score 1 · Accepted Answer

我第二个 jwpat7 的答案，但我非常幼稚的方法是将每个直方图 bin 中的项目数视为一个y值，将这些值x视为仅20 步，0..1然后获取描述为三次函数的参数。abcx vs y

为了获得直方图的“覆盖”，我只是遍历了每个参数的“可能”值。

例如，为了获得 27 个直方图来覆盖我的三次直方图模型的“形状空间”，我通过迭代参数-1 .. 1，选择线性间隔的 3 个值。

历史记录3

现在，如果您认为您的数据通常会以这种方式表示，或者您认为最具描述性的任何模型，您可以将直方图模型更改为四次方，并生成要覆盖的直方图。我使用 27 是因为三个参数的每个参数的三个分区是3*3*3=27.

要获得更全面的覆盖，例如100，您必须更仔细地选择每个参数的范围。 100**.3不是整数，所以简单的num_covers**(1/num_params)解决方案不起作用，但对于 3 个参数4*5*5会。

由于参数的实际值可能会有很大差异，但仍能达到相同的形状，因此最好存储它们的比率以进行比较，例如，对于我的 3 个参数b/a和b/c.

这是一个使用四次模型的 81 直方图“覆盖”，同样具有从中选择的参数linspace(-1,1,3)：

大历史4

编辑：既然你说你的直方图是由大约 20 个元素的数组描述的，我认为拟合参数会非常快。

再次考虑edit2我认为在模型中使用常量是没有意义的，重要的是形状。

algorithm - "Covering" the space of all possible histogram shapes

3 回答 3

Related

Reference