algorithm - Byte-Pairing for data compression

Question

Question about Byte-Pairing for data compression. If byte pairing converts two byte values to a single byte value, splitting the file in half, then taking a gig file and recusing it 16 times shrinks it to 62,500,000. My question is, is byte-pairing really efficient? Is the creation of a 5,000,000 iteration loop, to be conservative, efficient? I would like some feed back on and some incisive opinions please.

Dave, what I read was:
"The US patent office no longer grants patents on perpetual motion machines, but has recently granted at least two patents on a mathematically impossible process: compression of truly random data."
I was not inferring the Patent Office was actually considering what I am inquiring about. I was merely commenting on the notion of a "mathematically impossible process." If someone has, in some way created a method of having a "single" data byte as a placeholder of 8 individual bytes of data, that would be a consideration for a patent. Now, about the mathematically impossibility of an 8 to 1 compression method, it is not so much a mathematically impossibility, but a series of rules and conditions that can be created. As long as there is the rule of 8 or 16 bit representation of storing data on a medium, there are ways to manipulate data that mirrors current methods, or creation by a new way of thinking.

score 5 · Accepted Answer

通常，您所描述的“递归压缩”是海市蜃楼：压缩实际上并不是这样工作的。

首先，您应该意识到所有压缩算法都有可能扩展输入文件而不是压缩它。您可以通过一个简单的计数参数来证明这一点：请注意，任何文件的压缩版本必须不同于任何其他文件的压缩版本（否则您将无法正确解压缩该文件）。此外，对于任何文件大小N，可能存在固定数量的大小文件<=N。如果任何大小的文件> N都可以压缩到 size ，那么在“压缩”时<= N，相同数量的大小文件<= N必须扩展为 size 。>N

其次，“真正随机”的文件是不可压缩的。压缩之所以有效，是因为压缩算法希望接收具有某些可预测规律的文件。但是，“真正随机”的文件在定义上是不可预测的：每个随机文件都与其他具有相同长度的随机文件一样可能，因此它们不会压缩。

实际上，您有一个模型将某些文件视为比其他文件更有可能；要压缩此类文件，您希望为更有可能的输入文件选择较短的输出文件。信息论告诉我们压缩文件最有效的方法是为每个概率P的输入文件分配一个长度为~ log2(1/P)位的输出文件。这意味着，理想情况下，给定长度的每个输出文件具有大致相等的概率，就像“真正随机”的文件一样。

在给定长度的完全随机文件中，每个文件都有概率(0.5)^(#original bits)。从上面看的最佳长度是~ log2(1/ 0.5^(#original bits) ) = (#original bits)——也就是说，原始长度是你能做的最好的。

因为一个好的压缩算法的输出几乎是随机的，重新压缩压缩文件会让你几乎没有收获。由于次优的建模和编码，任何进一步的改进实际上都是“泄漏”；此外，压缩算法往往会扰乱它们没有利用的任何规律性，从而使这种“泄漏”的进一步压缩更加困难。

有关此主题的更长时间的说明，以及许多此类失败命题的示例，请参阅comp.compression FAQ。“递归压缩”的主张非常突出。

algorithm - Byte-Pairing for data compression

1 回答 1

Related

Reference