0

我有以下格式的整数序列:

Integer1 Integer2 Integer3 Integer4 Integer5 ....

每四个连续整数对应于单个记录的值。所以,我不能真正订购它们。

压缩此类文件的最佳方法是什么?

更新:

1-这些值彼此独立。每 4 个连续整数代表一条记录,例如:

CustomerId PurchaseId 产品 MoneySpent

每个都保存一个整数值。

2-理想情况下,我希望将其压缩为对象并保存在磁盘上。

谢谢

4

2 回答 2

0

最简单和最兼容的方法是在编写文件时通过 GZIP 压缩文件,方法是使用 GZIPOutputStream 包装您的流并使用 GZIPInputStream 包装它来读取它。

InputStream in = new BufferedInputStream(new GZIPInputStream(new FileInputStream(filename)));

OutputStream out = new BufferedOutputStream(new GZIPOutputStream(new FileOutputStream(filename)));
于 2013-01-24T14:27:26.910 回答
0

在给定的方式中,使用 GZip 并不是最优的。由于您的 OrderID、您的 PurcaseId、ProductID 和 MoneySpent 彼此不同,但所有 OrderId 与 PurcaseId、ProductId 和 MoneySpent 有一些共同点。因此,最好不要按行存储这些值,而是按列存储这些值。

由于您通常在要存储的此表中具有排序顺序,因此一列可以用增量值表示。例如,如果您按 OrderId 对值进行排序,则可以将 10、23、44、53 的序列表示为 +10、+13、+21、+53。这些数字比原始数字更小,更容易重复。

整数值可以表示为可变位长信息。首先,您存储值的位数而不是实际值。这样可以节省很多前导零。

For money spent you can also think about the actual repetition of typical numbers like 99, 25, 50, 49 and so on for the cent values. It is more likely that a product has the price of 49,99 but not 51,23. So spliting the money integer into two values will give you the ability to use Huffman encoding and treat special values as symbols and the rest as runlength bits.

To express the bit length, you can also use different encoding schemes one would be yet again a huffman code of 64 symbols (64 different length information) and train a coding schema. This way you will end up with very less numbers of bits instead of writing integers or even longs.

The remaining stuff can be put into gzip. This works usually better depending on the way you express the bit length since it is easier to compress leading zeros than different bit length information but every compression cost.

Another coding scheme for bit lengths is using the min max approach.

For example for the above sequence 10, 23, 44, 53 we store 10, +43 (53), +13, +23. The idea is to know that between 10 and 53 there are 43 elements. So the next value has a maximum length of 6 (2^6 = 64) bits. This way there is no need for bit length information. You just store the sequence in the oder first minimum, next maximum, next minimum, next maximum and so on.

A more efficient scheme is using minimum, maximum, middle, middle left, middle right, middle left left, middle left right, middle right left, middle right right ... . This way you have the best chance to result in smallest bit length knowledge. Using this way results in very small sizes of the integers without additional bit length information.

Using such schemes often leaves GZip a chance of further reduction by < 10% resulting in the omitting of GZip at all.

[Summary]

So GZip is simple, if you need to squeeze out more, go for column wise instead of row/entry wise. Use special knowledge of each column. If sorted use deltas as representation. Use bit lengths informations being expressed by huffman codes (one for each column) and using values for cents and dollars for product prices often result in very good compression chances. Store sorted columns by deltas and use the tree wise storage resulting in very good knowledge about the bit length to expect next.

于 2015-03-10T01:03:48.467 回答