java - 霍夫曼编码 - 处理 unicode

Question

我在java中实现了一个霍夫曼编码，它适用于输入文件中的字节数据。但是，它仅在压缩 ascii 时有效。我想扩展它，以便它可以处理大于 1 个字节长的字符，但我不确定如何准确地做到这一点。

private static final int CHARS = 256;     
private int [] getByteFrequency(File f) throws FileNotFoundException {
    try {
        FileInputStream fis = new FileInputStream(f);
        byte [] bb = new byte[(int) f.length()];
        int [] aa = new int[CHARS];
            if(fis.read(bb) == bb.length) {
                System.out.print("Uncompressed data: ");
                for(int i = 0; i < bb.length; i++) {
                        System.out.print((char) bb[i]);
                        aa[bb[i]]++;
                }
                System.out.println();
            }
        return aa;
    } catch (FileNotFoundException e) { throw new FileNotFoundException(); 
    } catch (IOException e) { e.printStackTrace(); }
    return null;
}

例如，这是我用来获取文件中字符频率的方法，显然它只适用于单个字节。如果我给它一个 unicode 文件，我会在处得到一个 ArrayIndexOutOfBoundsException aa[bb[i]]++;，而 i 通常是一个负数。我知道这是因为aa[bb[i]]++;只看一个字节，而 unicode 字符将不止一个，但我不确定如何更改它。

有人可以给我一些指示吗？

score 0 · Accepted Answer

尝试以下操作：

private static final int CHARS = 256;     
private int [] getByteFrequency(File f) throws FileNotFoundException {
    try {
        FileInputStream fis = new FileInputStream(f);
        byte [] bb = new byte[(int) f.length()];
        int [] aa = new int[CHARS];
            if(fis.read(bb) == bb.length) {
                System.out.print("Uncompressed data: ");
                for(int i = 0; i < bb.length; i++) {
                        System.out.print((char) bb[i]);
                        aa[((int)bb[i])&0xff]++;
                }
                System.out.println();
            }
        return aa;
    } catch (FileNotFoundException e) { throw new FileNotFoundException(); 
    } catch (IOException e) { e.printStackTrace(); }
    return null;
}

如果我是正确的（我还没有测试过），那么你的问题是 byte 是 java 中的一个 SIGNED 值。转换为整数 + 将其掩码为 0xff 应该正确处理它。

java - 霍夫曼编码 - 处理 unicode

1 回答 1

Related

Reference