c++ - 如何在 C++ 中表示二进制数（用于 Huffman 编码器）？

Question

我正在编写自己的霍夫曼编码器，到目前为止，我已经创建了霍夫曼树，方法是使用 minHeap 弹出两个最低频率节点并创建一个链接到它们的节点，然后将新节点推回一个（起泡，冲洗，重复直到只有一个节点）。

所以现在我已经创建了树，但是我需要使用这棵树来为每个字符分配代码。我的问题是我不知道如何在 C++ 中存储数字的二进制表示。我记得读过 unsigned char 是字节的标准，但我不确定。

我知道我必须反复遍历树，每当我碰到叶节点时，我必须分配相应的字符，无论当前代表路径的代码是什么。

这是我到目前为止所拥有的：

void traverseFullTree(huffmanNode* root, unsigned char curCode, unsigned char &codeBook){

    if(root->leftChild == 0 && root->rightChild == 0){ //you are at a leaf node, assign curCode to root's character
        codeBook[(int)root->character] = curCode;
    }else{ //root has children, recurse into them with the currentCodes updated for right and left branch
        traverseFullTree(root->leftChild, **CURRENT CODE SHIFTED WITH A 0**, codeBook );
        traverseFullTree(root->rightChild, **CURRENT CODE SHIFTED WITH A 1**, codeBook);
    }

    return 0;
}

CodeBook 是我的数组，最多可容纳 256 个字符的代码（对于 ASCII 中的每个可能字符），但我只会将代码实际分配给出现在树中的值。

我不确定这是否是遍历我的霍夫曼树的正确方法，但这似乎立即起作用（尽管我还没有测试过）。另外，我如何调用没有零或一（树的最顶端）的整个树的根的遍历函数？

我应该改用字符串并将零或 1 附加到字符串吗？

score 1 · Accepted Answer

由于计算机是二进制的...... C/C++ 中的所有数字都已经是二进制格式。

int a = 10;

变量a是二进制数。

您要查看的是位操作，例如& | << >>.

使用 Huffman 编码，您可以将数据打包成一个字节数组。

我已经很久没有写 C 语言了，所以这是一个“即兴”的伪代码......

完全未经测试——但应该给你正确的想法。

char buffer[1000]; // This is the buffer we are writing to -- calc the size out ahead of time or build it dynamically as go with malloc/ remalloc.

void set_bit(bit_position) {
  int byte = bit_position / 8;
  int bit = bit_position % 8;

  // From http://stackoverflow.com/questions/47981/how-do-you-set-clear-and-toggle-a-single-bit-in-c
  byte |= 1 << bit;
}

void clear_bit(bit_position) {
  int byte = bit_position / 8;
  int bit = bit_position % 8;

  // From http://stackoverflow.com/questions/47981/how-do-you-set-clear-and-toggle-a-single-bit-in-c
 bite &= ~(1 << bit);
}


// and in your loop, you'd just call these functions to set the bit number.
set_bit(0);
clear_bit(1);

score 0 · Accepted Answer

由于 curCode 的值只有零和一，因此 BitSet 可能适合您的需要。它既方便又节省内存。参考这个：http ://www.sgi.com/tech/stl/bitset.html

只需对您的代码稍作改动：

void traverseFullTree(huffmanNode* root, unsigned char curCode, BitSet<N> &codeBook){

    if(root->leftChild == 0 && root->rightChild == 0){ //you are at a leaf node, assign curCode to root's character
        codeBook[(int)root->character] = curCode;
    }else{ //root has children, recurse into them with the currentCodes updated for right and left branch
        traverseFullTree(root->leftChild, **CURRENT CODE SHIFTED WITH A 0**, codeBook );
        traverseFullTree(root->rightChild, **CURRENT CODE SHIFTED WITH A 1**, codeBook);
    }

    return 0;
}

score 0 · Accepted Answer

请不要使用字符串。

您可以将码本表示为两个整数数组，一个包含代码的位长度，一个包含代码本身。这样做有一个问题：如果代码比整数长怎么办？解决方案就是不要让这种情况发生。由于各种原因，具有较短的最大码长（比如 15）是霍夫曼编码的大多数实际应用中使用的技巧。

我建议使用规范的 Huffman 代码，这会稍微简化您的树遍历：您只需要长度，因此您不必跟踪当前代码。使用规范的霍夫曼代码，您可以轻松地从长度生成代码。

如果您使用规范代码，则可以让代码比整数更宽，因为无论如何高位都是零。但是，限制长度仍然是一个好主意。具有较短的最大长度（不太短，这会限制压缩，但说大约 16）使您能够使用最简单的基于表的解码方法，即简单的单级表。

将代码长度限制为 25 或更少也略微简化了编码，它允许您使用 32 位整数作为“缓冲区”并逐字节清空它，无需对缓冲区少于 8 位但编码当前的情况进行任何特殊处理符号会溢出它（因为完全避免了这种情况 - 在最坏的情况下，缓冲区中有 7 位，您尝试编码一个 25 位符号，这很好）。

像这样的东西（未以任何方式测试）

uint32_t buffer = 0;
int bufbits = 0;
for (int i = 0; i < symbolCount; i++)
{
    int s = symbols[i];
    buffer <<= lengths[s];  // make room for the bits
    bufbits += lengths[s];  // buffer got longer
    buffer |= values[s];    // put in the bits corresponding to the symbol

    while (bufbits >= 8)    // as long as there is at least a byte in the buffer
    {
        bufbits -= 8;       // forget it's there
        writeByte((buffer >> bufbits) & 0xFF); // and save it
    }
}

score 0 · Accepted Answer

如何在 C++ 中存储数字的二进制表示

你可以简单地使用bitsets

#include <iostream>
#include <bitset>

int main() {
  int a = 42;
  std::bitset<(sizeof(int) * 8)> bs(a);

  std::cout << bs.to_string() << "\n";
  std::cout << bs.to_ulong() << "\n";
  return (0);
}

如您所见，它们还提供了转换为其他类型的方法，以及方便的[]运算符。

c++ - 如何在 C++ 中表示二进制数（用于 Huffman 编码器）？

4 回答 4

Related

Reference