java - 在处理给定的一组数据时，如何为 zlib 'setDictionary' 找到一个好的/最佳字典？

Question

我有一组（大量）类似的数据文件。该集合不断增长。单个文件的大小约为 10K。每个文件都必须单独压缩。压缩是通过java.util.zip.Deflater类使用的 zlib 库完成的。使用将字典传递给 Deflate 算法setDictionary时，我可以提高压缩率。

有没有办法（算法）找到“最佳”字典，即具有整体最佳压缩比的字典？

参见zlib 手册

score 16 · Accepted Answer

John Reiser解释了 comp.compression：

对于字典：制作短子串的直方图，按收益（出现次数乘以压缩时保存的位数）排序，并将收益最高的子串放入字典中。例如，如果 k 是可以压缩的最短子串的长度（通常为 3==k 或 2==k），则制作所有长度为 k、1+k、2+k 的子串的直方图，并3+k。 当然，将这些子字符串放入字典中有一些技巧，利用子字符串、重叠、靠近高地址端的短字符串等。

Linux 内核使用类似的技术来压缩用于打印子例程调用堆栈的回溯的符号名称。请参阅文件 scripts/kallsyms.c。例如，https://code.woboq.org/linux/linux/scripts/kallsyms.c.html

zlib 手册建议将最常见的出现放在字典的末尾。

字典应该由稍后可能在要压缩的数据中遇到的字符串（字节序列）组成，最常用的字符串最好放在字典的末尾。当要压缩的数据很短并且可以很准确地预测时，使用字典是最有用的；然后可以比使用默认的空字典更好地压缩数据。

这是因为 LZ77 具有滑动窗口算法，因此在您的数据流上，后面的子字符串将比前几个更远。

我会尝试使用具有良好字符串支持的高级语言生成字典。一个粗略的 JavaScript 示例：

var str = "The dictionary should consist of strings (byte sequences) that"
    + " are likely to be encountered later in the data to be compressed,"
    + " with the most commonly used strings preferably put towards the "
    + "end of the dictionary. Using a dictionary is most useful when the"
    + " data to be compressed is short and can be predicted with good"
    + " accuracy; the data can then be compressed better than with the "
    + "default empty dictionary.";
// Extract words, remove punctuation (extra: replace(/\s/g, " "))
var words = str.replace(/[,\;.:\(\)]/g, "").split(" ").sort();
var  wcnt = [], w = "", cnt = 0; // pairs, current word, current word count
for (var i = 0, cnt = 0, w = ""; i < words.length; i++) {
    if (words[i] === w) {
        cnt++; // another match
    } else {
        if (w !== "")
            wcnt.push([cnt, w]); // Push a pair (count, word)
        cnt = 1; // Start counting for this word
        w = words[i]; // Start counting again
    }
}
if (w !== "")
    wcnt.push([cnt, w]); // Push last word
wcnt.sort(); // Greater matches at the end
for (var i in wcnt)
    wcnt[i] = wcnt[i][1]; // Just take the words
var dict = wcnt.join("").slice(-70); // Join the words, take last 70 chars

然后 dict 是一个由 70 个字符组成的字符串，其中包含：

rdsusedusefulwhencanismostofstringscompresseddatatowithdictionarybethe

您可以在这里尝试复制粘贴运行（添加：“print（dict）”）

那只是整个单词，而不是子字符串。还有一些方法可以重叠公共子字符串以节省字典空间。

java - 在处理给定的一组数据时，如何为 zlib 'setDictionary' 找到一个好的/最佳字典？

1 回答 1

Related

Reference