c - LZW Compression with Entire unicode library

Question

I am trying to do this problem:

Assume we have an initial alphabet of the entire Unicode character set, instead of just all the possible byte values. Recall that unicode characters are unsigned 2-byte values, so this means that each 2 bytes of uncompressed data will be treated as one symbol, and we'll have an alphabet with over 60,000 symbols. (Treating symbols as 2-byte Unicodes, rather than a byte at a time, makes for better compression in the case of internationalized text.) And, note, there's nothing that limits the number of bits per code to at most 16. As you generalize the LZW algorithm for this very large alphabet, don't worry if you have some pretty long codes.

With this, give the compressed version of this four-symbol sequence, using our project assumptions, including an EOD code, and grouping into 4-byte ints. (These three symbols are Unicode values, represented numerically.) Write your answer as 3 8-digit hex values, space separated, using capital hex digits, not lowercase.

32767 32768 32767 32768

The problem I am having is that I don't know the entire range of the alphabet, so when doing LZW compression I don't know what byte value the new codes will have. Stemming from that problem I also don't know the the EOD code will be.

Also, it seems to me that it will only take two integers the compressed data.

score 2 · Accepted Answer

问题陈述格式不正确。

正如我们今天所知，在 Unicode 中，代码点（代表字符、字符的可组合部分和其他有用但更隐蔽的东西的那些数字）不能全部从 0 到 65535 编号以适应 16 位。Unicode 中有超过 10 万个中文、日文和韩文字符。显然，您只需要 17 位以上的位即可。因此，Unicode 显然不是这里的正确选择。

OTOH，存在一种“精简”版本的 Unicode，通用字符集，其UCS-2编码使用 16 位代码点，技术上最多可用于 65536 个字符等。那些代码大于 65535 的字符很不幸，UCS-2 不能使用它们。

因此，如果它真的是 UCS-2，您可以下载它的规范（我相信是 ISO/IEC 10646）并准确找出使用了 64K 中的哪些代码，因此应该形成您的初始 LZW 字母表。

c - LZW Compression with Entire unicode library

1 回答 1

Related

Reference