algorithm - Best Compression algorithm for a sequence of integers

Question

I have a large array with a range of integers that are mostly continuous, eg 1-100, 110-160, etc. All integers are positive. What would be the best algorithm to compress this?

I tried the deflate algorithm but that gives me only 50% compression. Note that the algorithm cannot be lossy.

All numbers are unique and progressively increasing.

Also if you can point me to the java implementation of such algorithm that would be great.

score 82 · Accepted Answer

我们最近撰写了研究论文，调查了解决这个问题的最佳方案。请参见：

Daniel Lemire 和 Leonid Boytsov，通过矢量化每秒解码数十亿个整数，软件：实践与经验 45 (1)，2015。http ://arxiv.org/abs/1209.2137

Daniel Lemire、Nathan Kurz、Leonid Boytsov，SIMD 压缩和排序整数的交集，软件：实践和经验（待发表）http://arxiv.org/abs/1401.6399

它们包括广泛的实验评估。

您可以在线找到 C++11 中所有技术的完整实现： https://github.com/lemire/FastPFor和https://github.com/lemire/SIMDCompressionAndIntersection

还有 C 库：https ://github.com/lemire/simdcomp和https://github.com/lemire/MaskedVByte

如果您更喜欢 Java，请参阅https://github.com/lemire/JavaFastPFOR

score 38 · Accepted Answer

首先，通过获取每个值与前一个值之间的差异来预处理您的值列表（对于第一个值，假设前一个值为零）。在您的情况下，这应该主要给出一个序列，大多数压缩算法可以更容易地对其进行压缩。

这就是 PNG 格式改进其压缩的方式（它采用几种不同的方法之一，后跟 gzip 使用的相同压缩算法）。

score 18 · Accepted Answer

Well, i'm voting for smarter way. All you have to store is [int:startnumber][int/byte/whatever:number of iterations] in this case, you'll turn your example array into 4xInt value. After it you can compress as you want :)

score 17 · Accepted Answer

虽然您可以设计特定于您的数据流的自定义算法，但使用现成的编码算法可能更容易。我对 Java 中可用的压缩算法进行了一些测试，发现一百万个连续整数序列的压缩率如下：

None        1.0
Deflate     0.50
Filtered    0.34
BZip2       0.11
Lzma        0.06

score 12 · Accepted Answer

数字是多少？除了其他答案之外，您还可以考虑 base-128 变体长度编码，它允许您将较小的数字存储在单个字节中，同时仍然允许较大的数字。MSB 的意思是“还有另一个字节”——这在此处进行了描述。

将此与其他技术相结合，以便存储“skip size”、“take size”、“skip size”、“take size”——但请注意，“skip”和“take”都不会为零，所以我们将从每个中减去一个（这可以让您为少数几个值节省一个额外的字节）

所以：

1-100, 110-160

是“skip 1”（假设从零开始，因为它使事情变得更容易），“take 100”，“skip 9”，“take 51”；从每个中减去 1，给出（作为小数）

0,99,8,50

编码为（十六进制）：

00 63 08 32

如果我们想跳过/取一个更大的数字——例如 300；我们减去 1 得到 299 - 但这超过了 7 位；从小端开始，我们对 7 位块和一个 MSB 进行编码以指示继续：

299 = 100101100 = (in blocks of 7): 0000010 0101100

所以从小端开始：

1 0101100 (leading one since continuation)
0 0000010 (leading zero as no more)

给予：

AC 02

所以我们可以很容易地对大数字进行编码，但小数字（这听起来很典型的skip/take）占用的空间更少。

您可以尝试通过“放气”运行它，但它可能无济于事......

如果您不想自己处理所有那些乱七八糟的编码...如果您可以创建值的整数数组 (0,99,8,60) - 您可以使用带有打包重复 uint32/ 的协议缓冲区uint64 - 它会为你做所有的工作;-p

我不“做”Java，但这是一个完整的 C# 实现（从我的protobuf-net项目中借用一些编码位）：

using System;
using System.Collections.Generic;
using System.IO;
using System.Linq;
static class Program
{
    static void Main()
    {
        var data = new List<int>();
        data.AddRange(Enumerable.Range(1, 100));
        data.AddRange(Enumerable.Range(110, 51));
        int[] arr = data.ToArray(), arr2;

        using (MemoryStream ms = new MemoryStream())
        {
            Encode(ms, arr);
            ShowRaw(ms.GetBuffer(), (int)ms.Length);
            ms.Position = 0; // rewind to read it...
            arr2 = Decode(ms);
        }
    }
    static void ShowRaw(byte[] buffer, int len)
    {
        for (int i = 0; i < len; i++)
        {
            Console.Write(buffer[i].ToString("X2"));
        }
        Console.WriteLine();
    }
    static int[] Decode(Stream stream)
    {
        var list = new List<int>();
        uint skip, take;
        int last = 0;
        while (TryDecodeUInt32(stream, out skip)
            && TryDecodeUInt32(stream, out take))
        {
            last += (int)skip+1;
            for(uint i = 0 ; i <= take ; i++) {
                list.Add(last++);
            }
        }
        return list.ToArray();
    }
    static int Encode(Stream stream, int[] data)
    {
        if (data.Length == 0) return 0;
        byte[] buffer = new byte[10];
        int last = -1, len = 0;
        for (int i = 0; i < data.Length; )
        {
            int gap = data[i] - 2 - last, size = 0;
            while (++i < data.Length && data[i] == data[i - 1] + 1) size++;
            last = data[i - 1];
            len += EncodeUInt32((uint)gap, buffer, stream)
                + EncodeUInt32((uint)size, buffer, stream);
        }
        return len;
    }
    public static int EncodeUInt32(uint value, byte[] buffer, Stream stream)
    {
        int count = 0, index = 0;
        do
        {
            buffer[index++] = (byte)((value & 0x7F) | 0x80);
            value >>= 7;
            count++;
        } while (value != 0);
        buffer[index - 1] &= 0x7F;
        stream.Write(buffer, 0, count);
        return count;
    }
    public static bool TryDecodeUInt32(Stream source, out uint value)
    {
        int b = source.ReadByte();
        if (b < 0)
        {
            value = 0;
            return false;
        }

        if ((b & 0x80) == 0)
        {
            // single-byte
            value = (uint)b;
            return true;
        }

        int shift = 7;

        value = (uint)(b & 0x7F);
        bool keepGoing;
        int i = 0;
        do
        {
            b = source.ReadByte();
            if (b < 0) throw new EndOfStreamException();
            i++;
            keepGoing = (b & 0x80) != 0;
            value |= ((uint)(b & 0x7F)) << shift;
            shift += 7;
        } while (keepGoing && i < 4);
        if (keepGoing && i == 4)
        {
            throw new OverflowException();
        }
        return true;
    }
}

score 5 · Accepted Answer

TurboPFor：最快的整数压缩

用于 C/C++，包括 Java Critical Natives/JNI 接口
SIMD 加速整数压缩
用于已排序/未排序整数列表的标量 + 集成 (SIMD) 差分/之字形编码/解码
全范围 8/16/32/64 位整数列表
直接访问
基准应用

score 3 · Accepted Answer

compress the string "1-100, 110-160" or store the string in some binary representation and parse it to restore the array

score 3 · Accepted Answer

除了其他解决方案：

您可以找到“密集”区域并使用位图来存储它们。

例如：

如果在 1000-3000 之间的 400 个范围中有 1000 个数字，则可以使用一个位来表示数字的存在，并使用两个整数来表示范围。此范围的总存储空间为 2000 位 + 2 个整数，因此您可以将该信息存储在 254 字节中，这非常棒，因为即使是短整数也会占用两个字节，因此对于本示例，您可以节省 7 倍。

区域越密集，这个算法的效果就越好，但在某些时候，只存储开始和结束会更便宜。

score 3 · Accepted Answer

我会结合 CesarB 和 Fernando Miguélez 给出的答案。

首先，存储每个值与前一个值之间的差异。正如 CesarB 指出的那样，这将为您提供一系列主要为 1 的序列。

然后，对该序列使用运行长度编码压缩算法。由于大量重复值，它将非常好地压缩。

score 2 · Accepted Answer

我建议看一下Huffman Coding ，这是 Arithmetic Coding的一个特例。在这两种情况下，您都会分析您的起始序列以确定不同值的相对频率。与不太频繁出现的值相比，更频繁出现的值使用更少的位进行编码。

score 2 · Accepted Answer

我知道这是一个旧的消息线程，但我包括我在此处找到的 SKIP/TAKE 想法的个人 PHP 测试。我打电话给我的 STEP(+)/SPAN(-)。也许有人会觉得它有帮助。

注意：我实现了允许重复整数和负整数的能力，即使最初的问题涉及正的、非重复的整数。如果您想尝试减少一两个字节，请随意调整它。

代码：

  // $integers_array can contain any integers; no floating point, please. Duplicates okay.
  $integers_array = [118, 68, -9, 82, 67, -36, 15, 27, 26, 138, 45, 121, 72, 63, 73, -35,
                    68, 46, 37, -28, -12, 42, 101, 21, 35, 100, 44, 13, 125, 142, 36, 88,
                    113, -40, 40, -25, 116, -21, 123, -10, 43, 130, 7, 39, 69, 102, 24,
                    75, 64, 127, 109, 38, 41, -23, 21, -21, 101, 138, 51, 4, 93, -29, -13];

  // Order from least to greatest... This routine does NOT save original order of integers.
  sort($integers_array, SORT_NUMERIC); 

  // Start with the least value... NOTE: This removes the first value from the array.
  $start = $current = array_shift($integers_array);    

  // This caps the end of the array, so we can easily get the last step or span value.
  array_push($integers_array, $start - 1);

  // Create the compressed array...
  $compressed_array = [$start];
  foreach ($integers_array as $next_value) {
    // Range of $current to $next_value is our "skip" range. I call it a "step" instead.
    $step = $next_value - $current;
    if ($step == 1) {
        // Took a single step, wait to find the end of a series of seqential numbers.
        $current = $next_value;
    } else {
        // Range of $start to $current is our "take" range. I call it a "span" instead.
        $span = $current - $start;
        // If $span is positive, use "negative" to identify these as sequential numbers. 
        if ($span > 0) array_push($compressed_array, -$span);
        // If $step is positive, move forward. If $step is zero, the number is duplicate.
        if ($step >= 0) array_push($compressed_array, $step);
        // In any case, we are resetting our start of potentialy sequential numbers.
        $start = $current = $next_value;
    }
  }

  // OPTIONAL: The following code attempts to compress things further in a variety of ways.

  // A quick check to see what pack size we can use.
  $largest_integer = max(max($compressed_array),-min($compressed_array));
  if ($largest_integer < pow(2,7)) $pack_size = 'c';
  elseif ($largest_integer < pow(2,15)) $pack_size = 's';
  elseif ($largest_integer < pow(2,31)) $pack_size = 'l';
  elseif ($largest_integer < pow(2,63)) $pack_size = 'q';
  else die('Too freaking large, try something else!');

  // NOTE: I did not implement the MSB feature mentioned by Marc Gravell.
  // I'm just pre-pending the $pack_size as the first byte, so I know how to unpack it.
  $packed_string = $pack_size;

  // Save compressed array to compressed string and binary packed string.
  $compressed_string = '';
  foreach ($compressed_array as $value) {
      $compressed_string .= ($value < 0) ? $value : '+'.$value;
      $packed_string .= pack($pack_size, $value);
  }

  // We can possibly compress it more with gzip if there are lots of similar values.      
  $gz_string = gzcompress($packed_string);

  // These were all just size tests I left in for you.
  $base64_string = base64_encode($packed_string);
  $gz64_string = base64_encode($gz_string);
  $compressed_string = trim($compressed_string,'+');  // Don't need leading '+'.
  echo "<hr>\nOriginal Array has "
    .count($integers_array)
    .' elements: {not showing, since I modified the original array directly}';
  echo "<br>\nCompressed Array has "
    .count($compressed_array).' elements: '
    .implode(', ',$compressed_array);
  echo "<br>\nCompressed String has "
    .strlen($compressed_string).' characters: '
    .$compressed_string;
  echo "<br>\nPacked String has "
    .strlen($packed_string).' (some probably not printable) characters: '
    .$packed_string;
  echo "<br>\nBase64 String has "
    .strlen($base64_string).' (all printable) characters: '
    .$base64_string;
  echo "<br>\nGZipped String has "
    .strlen($gz_string).' (some probably not printable) characters: '
    .$gz_string;
  echo "<br>\nBase64 of GZipped String has "
    .strlen($gz64_string).' (all printable) characters: '
    .$gz64_string;

  // NOTICE: The following code reverses the process, starting form the $compressed_array.

  // The first value is always the starting value.
  $current_value = array_shift($compressed_array);
  $uncompressed_array = [$current_value];
  foreach ($compressed_array as $val) {
    if ($val < -1) {
      // For ranges that span more than two values, we have to fill in the values.
      $range = range($current_value + 1, $current_value - $val - 1);
      $uncompressed_array = array_merge($uncompressed_array, $range);
    }
    // Add the step value to the $current_value
    $current_value += abs($val); 
    // Add the newly-determined $current_value to our list. If $val==0, it is a repeat!
    array_push($uncompressed_array, $current_value);      
  }

  // Display the uncompressed array.
  echo "<hr>Reconstituted Array has "
    .count($uncompressed_array).' elements: '
    .implode(', ',$uncompressed_array).
    '<hr>';

输出：

--------------------------------------------------------------------------------
Original Array has 63 elements: {not showing, since I modified the original array directly}
Compressed Array has 53 elements: -40, 4, -1, 6, -1, 3, 2, 2, 0, 8, -1, 2, -1, 13, 3, 6, 2, 6, 0, 3, 2, -1, 8, -11, 5, 12, -1, 3, -1, 0, -1, 3, -1, 2, 7, 6, 5, 7, -1, 0, -1, 7, 4, 3, 2, 3, 2, 2, 2, 3, 8, 0, 4
Compressed String has 110 characters: -40+4-1+6-1+3+2+2+0+8-1+2-1+13+3+6+2+6+0+3+2-1+8-11+5+12-1+3-1+0-1+3-1+2+7+6+5+7-1+0-1+7+4+3+2+3+2+2+2+3+8+0+4
Packed String has 54 (some probably not printable) characters: cØÿÿÿÿ ÿõ ÿÿÿÿÿÿ
Base64 String has 72 (all printable) characters: Y9gE/wb/AwICAAj/Av8NAwYCBgADAv8I9QUM/wP/AP8D/wIHBgUH/wD/BwQDAgMCAgIDCAAE
GZipped String has 53 (some probably not printable) characters: xœ Ê» ÑÈÎ¯€)YšE¨MŠ“^qçºR¬m&Òõ‹%Ê&TFÊ‰ùÀ6ÿÁÁ Æ
Base64 of GZipped String has 72 (all printable) characters: eJwNyrsNACAMA9HIzq+AKVmaRahNipNecee6UgSsBW0m0gj1iyXKJlRGjcqJ+cA2/8HBDcY=
--------------------------------------------------------------------------------
Reconstituted Array has 63 elements: -40, -36, -35, -29, -28, -25, -23, -21, -21, -13, -12, -10, -9, 4, 7, 13, 15, 21, 21, 24, 26, 27, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 51, 63, 64, 67, 68, 68, 69, 72, 73, 75, 82, 88, 93, 100, 101, 101, 102, 109, 113, 116, 118, 121, 123, 125, 127, 130, 138, 138, 142
--------------------------------------------------------------------------------

score 1 · Accepted Answer

Your case is very similar to compression of indices in search engines. The popular compression algorithm used is the PForDelta algorithm and Simple16 algorithm. You can use the kamikaze library for your compression needs.

score 1 · Accepted Answer

您可能应该使用的基本思想是，对于每个连续整数范围（我将称之为这些范围），存储起始数字和范围的大小。例如，如果您有一个包含 1000 个整数的列表，但只有 10 个单独的范围，则您可以仅存储 20 个整数（每个范围有 1 个起始编号和 1 个大小）来表示此数据，压缩率为 98 %。幸运的是，您可以进行更多优化，这将有助于处理范围数量较大的情况。

存储起始编号相对于前一个起始编号的偏移量，而不是起始编号本身。这里的好处是您存储的数字通常需要更少的位（这可能会在以后的优化建议中派上用场）。此外，如果您只存储起始数字，这些数字都将是唯一的，而存储偏移量可以使数字接近甚至重复，这可能允许进一步压缩，之后应用另一种方法。
使用尽可能少的位数对于这两种类型的整数。您可以遍历数字以获得起始整数的最大偏移量以及最大范围的大小。然后，您可以使用最有效地存储这些整数的数据类型，并在压缩数据的开头简单地指定数据类型或位数。例如，如果起始整数的最大偏移量仅为 12,000，最大范围为 9,000 长，那么您可以对所有这些使用 2 字节无符号整数。然后，您可以在压缩数据的开头填充对 2,2 以显示两个整数都使用了 2 个字节。当然，您可以使用一点位操作将这些信息放入单个字节中。

通过这两个优化，让我们看几个例子（每个都是 4,000 字节）：

1,000 个整数，最大偏移量为 500，10 个范围
1,000 个整数，最大偏移量为 100，50 个范围
1,000 个整数，最大偏移量为 50，100 个范围

没有优化

20 个整数，每个 4 个字节 = 80 个字节。压缩 = 98%
100 个整数，每个 4 个字节 = 400 个字节。压缩 = 90%
200 个整数，每个 4 个字节 = 800 个字节。压缩 = 80%

通过优化

1 字节头 + 20 个数字，每个 1 字节 = 21 字节。压缩 = 99.475%
1 字节头 + 100 个数字，每个 1 字节 = 101 字节。压缩 = 97.475%
1 字节头 + 200 个数字，每个 1 字节 = 201 字节。压缩 = 94.975%

score 1 · Accepted Answer

为此，我无法让我的压缩比大约 0.11 好得多。我通过python解释器生成了我的测试数据，它是一个以换行符分隔的1-100和110-160整数列表。我使用实际程序作为数据的压缩表示。我的压缩文件如下：

main=mapM_ print [x|x<-[1..160],x`notElem`[101..109]]

它只是一个 Haskell 脚本，可以生成您可以通过以下方式运行的文件：

$ runhaskell generator.hs >> data

g.hs文件大小为54字节，python生成的数据为496字节。这给出了 0.10887096774193548 作为压缩比。我认为有更多的时间可以缩小程序，或者你可以压缩压缩文件（即haskell 文件）。

另一种方法可能是保存 4 个字节的数据。每个序列的最小值和最大值，然后将它们提供给生成函数。尽管如此，文件的加载会给解压缩器添加更多字符，从而为解压缩器增加更多复杂性和更多字节。同样，我通过一个程序表示了这个非常具体的序列，它没有概括，它是一种特定于数据的压缩。此外，增加通用性会使解压缩器更大。

另一个问题是必须有 Haskell 解释器才能运行它。当我编译程序时，它变得更大了。我真的不知道为什么。python也有同样的问题，所以也许最好的方法是给出范围，以便某个程序可以解压缩文件。

score 0 · Accepted Answer

If you have series of repeated values RLE is the easiest to implement and could give you a good result. Nontheless other more advanced algorithms that take into account the entrophy such as LZW, which is now patent-free, can usually achive a much better compression.

You can take a look at these and other lossless algorithms here.

algorithm - Best Compression algorithm for a sequence of integers

15 回答 15

Related

Reference