huffman-code - 从规范形式解码霍夫曼文件

Question

我正在编写一个霍夫曼文件，我将规范代码的代码长度存储在文件的标题中。在解码过程中，我能够重新生成规范代码并将它们存储到std::map<std:uint8_it, std::vector<bool>>. 实际数据被读入单个std::vector<bool>. 在有人建议我std::bitset使用std::vector<bool>. 那么，鉴于我有我的符号及其相应的规范代码，我该如何解码我的文件？我不知道从这里去哪里。有人可以向我解释我将如何解码这个文件，因为我在搜索时找不到任何与它相关的东西。

score 10 · Accepted Answer

您无需创建代码或树即可解码规范代码。您所需要的只是按顺序排列的符号列表以及每个代码长度中的符号数。“按顺序”是指按代码长度从最短到最长排序，并且在每个代码长度内，按符号值排序。

由于代码长度内的规范代码是顺序二进制整数，因此您可以简单地进行整数比较以查看您拥有的位是否在该代码范围内，如果是，则整数减法以确定它是哪个符号。

下面是puff.c中的代码（稍作更改），以明确显示这是如何完成的。 bits(s, 1)从流中返回下一位。（这假设总是有下一位。） h->count[len]是由长度len码编码的符号数，其中len是 in 0..MAXBITS。如果将h->count[1], h->count[2], ...,相加h->count[MAXBITS]，即为编码符号的总数，即为h->symbol[]数组的长度。中的第一个h->count[1]符号h->symbol[]长度为 1。接下来的h->count[2]符号h->symbol[]长度为 2。依此类推。

数组中的值h->count[]（如果正确）被限制为不会超额订阅可以以位编码的代码的可能数量len。可以进一步限制它表示一个完整的代码，即没有未定义的位序列，在这种情况下decode()不能返回错误（-1）。要使代码完整且不被超额订阅，总和h->count[len] << (MAXBITS - len)必须len等于1 << MAXBITS。

简单的例子：如果我们e用一个位、t两个位和a三个o位进行编码，那么h->count[]是{0, 1, 1, 2}（第一个值，h->count[0]未使用），并且h->symbol[]是{'e','t','a','o'}。然后是的代码e是，是，是，是0的代码。t10a110o111

#define MAXBITS 15              /* maximum bits in a code */

struct huffman {
    short *count;       /* number of symbols of each length */
    short *symbol;      /* canonically ordered symbols */
};

int decode(struct state *s, const struct huffman *h)
{
    int len;            /* current number of bits in code */
    int code;           /* len bits being decoded */
    int first;          /* first code of length len */
    int count;          /* number of codes of length len */
    int index;          /* index of first code of length len in symbol table */

    code = first = index = 0;
    for (len = 1; len <= MAXBITS; len++) {
        code |= bits(s, 1);             /* get next bit */
        count = h->count[len];
        if (code - count < first)       /* if length len, return symbol */
            return h->symbol[index + (code - first)];
        index += count;                 /* else update for next length */
        first += count;
        first <<= 1;
        code <<= 1;
    }
    return -1;                          /* ran out of codes */
}

score 1 · Accepted Answer

您的地图包含相关信息，但它将符号映射到代码。但是，您尝试解码的数据包含代码。因此，您的地图不能用于获取与以有效方式读取的代码相对应的符号，因为查找方法需要一个符号。搜索代码并检索相应的符号将是线性搜索。

相反，您应该重建为压缩步骤构建的 Huffman 树。内部节点的频率值在这里无关紧要，但您需要叶节点位于正确的位置。您可以在读取文件头时动态创建树。最初创建一棵空树。对于您阅读的每个符号到代码的映射，在树中创建相应的节点。例如，如果符号“D”已映射到代码 101，则确保根有一个右子节点，该节点有一个左子节点，该左子节点有一个包含符号“D”的右子节点，如果它们丢失，则创建节点。

使用该树，您可以按如下方式对流进行解码（伪代码，假设采用右孩子对应于在代码中添加 1）：

// use a node variable to remember the position in the tree while reading bits
node n = tree.root
while(stream not fully read) {
    read next bit into boolean b
    if (b == true) {
        n = n.rightChild
    } else {
        n = n.leftChild
    }
    // check whether we are in a leaf node now
    if (n.leftChild == null && n.rightChild == null) {
        // n is a leaf node, thus we have read a complete code
        // add the corresponding symbol to the decoded output
        decoded.add(n.getSymbol())
        // reset the search
        n = tree.root
    }
}

请注意，反转您的地图以使查找进入正确的方向仍然会导致性能欠佳（与二叉树遍历相比），因为它无法像遍历那样利用对较小搜索空间的限制。

huffman-code - 从规范形式解码霍夫曼文件

2 回答 2

Related

Reference