I am trying to do this problem:
Assume we have an initial alphabet of the entire Unicode character set, instead of just all the possible byte values. Recall that unicode characters are unsigned 2-byte values, so this means that each 2 bytes of uncompressed data will be treated as one symbol, and we'll have an alphabet with over 60,000 symbols. (Treating symbols as 2-byte Unicodes, rather than a byte at a time, makes for better compression in the case of internationalized text.) And, note, there's nothing that limits the number of bits per code to at most 16. As you generalize the LZW algorithm for this very large alphabet, don't worry if you have some pretty long codes.
With this, give the compressed version of this four-symbol sequence, using our project assumptions, including an EOD code, and grouping into 4-byte ints. (These three symbols are Unicode values, represented numerically.) Write your answer as 3 8-digit hex values, space separated, using capital hex digits, not lowercase.
32767 32768 32767 32768
The problem I am having is that I don't know the entire range of the alphabet, so when doing LZW compression I don't know what byte value the new codes will have. Stemming from that problem I also don't know the the EOD code will be.
Also, it seems to me that it will only take two integers the compressed data.