compression - 压缩一组大整数

Question

我有一组整数，我希望有最紧凑的表示。我有以下限制/功能：

它被设置，或者换句话说，一个唯一整数列表，其中的顺序无关紧要
集合 L 的大小相对较小（通常为 1000 个元素）
整数遵循 0 和 N-1 之间的均匀分布，其中 N 相对较大（例如 2^32）
对压缩集元素的访问是随机的，但如果解压过程不是那么快也可以
显然，压缩应该是无损的

我尝试了一些方法，但我对结果并不满意，并且我确信存在更好的解决方案：

增量编码（排序，然后编码差异），或者也排序，然后编码第 i 个元素和 i*N/L 之间的差异。两者都给出了合理的结果，但不是很好，可能是因为 N 和 L 的典型大小。霍夫曼编码增量没有帮助，因为它们通常很大。
递归范围缩小（http://ygdes.com/ddj-3r/ddj-3r_compact.html）。这看起来很聪明，但在指数递减的整数上效果最好，这绝对不是这里的情况。
这里关于stackoverflow的一些讨论与我的问题相似但不完全等同（用于压缩顺序正整数的C库，压缩排序整数）

我很高兴听到你可能有的任何想法。提前致谢！

更新：

事实证明，delta 编码似乎接近最优解。对于集合中元素的其他其他分布，这可能会有所不同。

score 13 · Accepted Answer

你可以通过数数来了解你能做的最好的事情。（我希望stackoverflow允许像math.stackexchange这样的TeX方程。无论如何......）

ceiling(log(Combination(2^32,1000)) / (8 * log(2))) = 2934

因此，如您所说，如果选择是均匀分布的，那么对于该特定情况，您平均希望的最佳压缩是 2934 字节。最佳比率是 4000 字节的未编码表示的 73.35%。

Combination(2^32,1000)只是压缩算法可能输入的总数。如果它们是均匀分布的，那么最佳编码是一个巨大的整数，它通过索引标识每个可能的输入。每个巨型整数值唯一地标识一个输入。想象一下在一个巨大的表格中按索引查找输入。 ceiling(log(Combination(2^32,1000)) / log(2))是该索引整数需要多少位。

更新：

我找到了一种使用现成的压缩工具接近理论最佳值的方法。我排序，应用增量编码，并从中减去一个（因为连续不同元素之间的增量至少为一个）。然后诀窍是我写出所有高字节，然后是下一个最重要的字节，等等。增量减一的高字节往往为零，因此将许多零组合在一起，标准压缩实用程序喜欢. 下一组字节也倾向于偏低值。

对于示例（来自 0..2^32-1 的 1000 个统一且不同的样本），我在运行时平均得到 3110 个字节，通过gzip -93098 个字节xz -9（xz 使用与 7zip 相同的压缩，LZMA）。这些非常接近理论上的最佳平均值 2934。此外，gzip 的开销为 18 字节，而 xz 的开销为 24 字节，无论是对于标题还是尾部。因此，与理论最佳值更公平的比较是 3092gzip -9和 3074 xz -9。比理论最佳值大 5% 左右。

更新 2：

我实现了对排列的直接编码，平均达到了 2974 字节，仅比理论上的最佳值高出 1% 多一点。我使用GNU 多精度算术库将每个排列的索引编码为一个巨大的整数。编码和解码的实际代码如下所示。我为这些函数添加了注释，这些mpz_*函数的名称可能并不明显，它们正在执行哪些算术运算。

/* Recursively code the members in set[] between low and high (low and high
   themselves have already been coded).  First code the middle member 'mid'.
   Then recursively code the members between low and mid, and then between mid
   and high. */
local void combination_encode_between(mpz_t pack, mpz_t base,
                                      const unsigned long *set,
                                      int low, int high)
{
    int mid;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately (also in that case, verify that set[] is sorted
       in ascending order) */
    mid = (low + high) >> 1;
    if (mid == low) {
        assert(set[low] < set[high]);
        return;
    }

    /* code set[mid] into pack, and update base with the number of possible
       set[mid] values between set[low] and set[high] for the next coded
       member */
        /* pack += base * (set[mid] - set[low] - 1) */
    mpz_addmul_ui(pack, base, set[mid] - set[low] - 1);
        /* base *= set[high] - set[low] - 1 */
    mpz_mul_ui(base, base, set[high] - set[low] - 1);

    /* code the rest between low and high */
    combination_encode_between(pack, base, set, low, mid);
    combination_encode_between(pack, base, set, mid, high);
}

/* Encode the set of integers set[0..num-1], where each element is a unique
   integer in the range 0..max.  No value appears more than once in set[]
   (hence the name "set").  The elements of set[] must be sorted in ascending
   order. */
local void combination_encode(mpz_t pack, const unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t base;

    /* handle degenerate cases and verify last member <= max -- code set[0]
       into pack as simply itself and set base to the number of possible set[0]
       values for coding the next member */
    if (num < 1) {
            /* pack = 0 */
        mpz_set_ui(pack, 0);
        return;
    }
        /* pack = set[0] */
    mpz_set_ui(pack, set[0]);
    if (num < 2) {
        assert(set[0] <= max);
        return;
    }
    assert(set[num - 1] <= max);
        /* base = max - num + 2 */
    mpz_init_set_ui(base, max - num + 2);

    /* code the last member of the set and update base with the number of
       possible last member values */
        /* pack += base * (set[num - 1] - set[0] - 1) */
    mpz_addmul_ui(pack, base, set[num - 1] - set[0] - 1);
        /* base *= max - set[0] */
    mpz_mul_ui(base, base, max - set[0]);

    /* encode the members between 0 and num - 1 */
    combination_encode_between(pack, base, set, 0, num - 1);
    mpz_clear(base);
}

/* Recursively decode the members in set[] between low and high (low and high
   themselves have already been decoded).  First decode the middle member
   'mid'. Then recursively decode the members between low and mid, and then
   between mid and high. */
local void combination_decode_between(mpz_t unpack, unsigned long *set,
                                      int low, int high)
{
    int mid;
    unsigned long rem;

    /* compute the middle position -- if there is nothing between low and high,
       then return immediately */
    mid = (low + high) >> 1;
    if (mid == low)
        return;

    /* extract set[mid] as the remainder of dividing unpack by the number of
       possible set[mid] values, update unpack with the quotient */
        /* div = set[high] - set[low] - 1, rem = unpack % div, unpack /= div */
    rem = mpz_fdiv_q_ui(unpack, unpack, set[high] - set[low] - 1);
    set[mid] = set[low] + 1 + rem;

    /* decode the rest between low and high */
    combination_decode_between(unpack, set, low, mid);
    combination_decode_between(unpack, set, mid, high);
}

/* Decode from pack the set of integers encoded by combination_encode(),
   putting the result in set[0..num-1].  max must be the same value used when
   encoding. */
local void combination_decode(const mpz_t pack, unsigned long *set, int num,
                              unsigned long max)
{
    mpz_t unpack;
    unsigned long rem;

    /* handle degnerate cases, returning the value of pack as the only element
       for num == 1 */
    if (num < 1)
        return;
    if (num < 2) {
            /* set[0] = (unsigned long)pack */
        set[0] = mpz_get_ui(pack);
        return;
    }

    /* extract set[0] as the remainder after dividing pack by the number of
       possible set[0] values, set unpack to the quotient */
    mpz_init(unpack);
        /* div = max - num + 2, set[0] = pack % div, unpack = pack / div */
    set[0] = mpz_fdiv_q_ui(unpack, pack, max - num + 2);

    /* extract the last member as the remainder after dividing by the number
       of possible values, taking into account the first member -- update
       unpack with the quotient */
        /* rem = unpack % max - set[0], unpack /= max - set[0] */
    rem = mpz_fdiv_q_ui(unpack, unpack, max - set[0]);
    set[num - 1] = set[0] + 1 + rem;

    /* decode the members between 0 and num - 1 */
    combination_decode_between(unpack, set, 0, num - 1);
    mpz_clear(unpack);
}

有一些mpz_*函数可以将数字写入文件并读回，或者将数字导出为内存中的指定格式，然后再将其导入。

score 2 · Accepted Answer

如果整数是随机的、不相关的，并且确实遵循 [0, 2³²-1[ 上的均匀分布规律，则可能可以证明您无法从平凡的表示中压缩数组。我错过了你的问题吗？

对于非随机数数组，我通常使用简单的deflate。这是一种常用的算法，因为它适用于一般而非完全随机的数组。您拥有所有主要语言的可调节压缩级别的良好库这一事实当然是另一个优势。

我使用 deflate 来压缩物理传感器测量的小阵列（大约 300 到 2000 个 32 位整数）并获得 70% 的增益，但这是因为连续的传感器测量很少有很大不同。

找到适合所有情况的明显更好的算法可能并不容易。大多数改进将来自您的数字系列的特殊性。

您可能还会注意到，通过将多个集合压缩在一起，您可以获得更好的压缩增益。当然，这可能非常不方便，具体取决于您的应用程序。

score 2 · Accepted Answer

题目还开吗？

我目前正在研究它。
（PS：我是游戏创建者而不是数学家）
几周以来都睡不好觉，因为我想知道为什么我们不使用 A^B+C 变体（或其他）来压缩图像和信息。

我的乌托邦目标是通过使用计算机 GPU 创建的 A^B+C 公式的不太可能的组合来压缩 4.600.000 位的数字。基本上我尝试这样做，因为它允许在（<100 个字符）下存储/流式传输小图像，而不会通过 Wifi 以 30 fps 的速度损失质量，并且不会破坏带宽。

我的现实目标是将 200 位数字压缩为 <5 个字符。

PS：为此，我已经创建了“Base Chinais”如果你想使用它：
- https://github.com/EloiStree/2019_09_19_MathCompressionOfImage/wiki/SouthChinais
- https://gitlab.com/eloitree/2019_09_06_UnicodeBasedId

Base(Chinais) 䶯 = 38727
它允许在碸^灾+㔩中转换 2307^200+32450<br> 如果您尝试使用原始压缩 BigInteger，则 base China 提供 4-4.5 倍的压缩
：
1413546486463454579816416416416462324833676542
둲觋㷬乮䄠櫡䒤갱

所以现在我需要将 <200 位压缩为 9999^9999+99999999
如果您对 A^B+C 有任何想法或替代方案，请随时警告我。
我花了很多时间通过 Unity3D 进行实验。
我将在这里发布我在 sujet 上找到的内容：
https ://github.com/EloiStree/2019_09_19_MathCompressionOfImage/wiki

希望它会帮助下一个跌倒在这里的人。

如果您想谈论它，请在 Discord 上找到我。
https://eloitree.page.link/discord

compression - 压缩一组大整数

3 回答 3

Related

Reference