java - 为什么 US-ASCII 编码接受非 US-ASCII 字符？

Question

考虑以下代码：

public class ReadingTest {

    public void readAndPrint(String usingEncoding) throws Exception {
        ByteArrayInputStream bais = new ByteArrayInputStream(new byte[]{(byte) 0xC2, (byte) 0xB5}); // 'micro' sign UTF-8 representation
        InputStreamReader isr = new InputStreamReader(bais, usingEncoding);
        char[] cbuf = new char[2];
        isr.read(cbuf);
        System.out.println(cbuf[0]+" "+(int) cbuf[0]);
    }

    public static void main(String[] argv) throws Exception {
        ReadingTest w = new ReadingTest();
        w.readAndPrint("UTF-8");
        w.readAndPrint("US-ASCII");
    }
}

观察到的输出：

µ 181
? 65533

为什么readAndPrint()（使用 US-ASCII 的那个）的第二次调用成功了？我希望它会引发错误，因为输入不是此编码中的正确字符。Java API 或 JLS 中强制执行此行为的位置是什么？

score 9 · Accepted Answer

在输入流中查找不可解码字节时的默认操作是用 Unicode Character U+FFFD REPLACEMENT CHARACTER替换它们。

如果要更改它，可以将 a 传递给具有不同配置的：CharacterDecoder InputStreamReaderCodingErrorAction

CharsetDecoder decoder = Charset.forName(usingEncoding).newDecoder();
decoder.onMalformedInput(CodingErrorAction.REPORT);
InputStreamReader isr = new InputStreamReader(bais, decoder);

score 3 · Accepted Answer

我会说，这与构造函数相同 String(byte bytes[], int offset, int length, Charset charset)：

此方法始终使用此字符集的默认替换字符串替换格式错误的输入和不可映射的字符序列。当需要对解码过程进行更多控制时，应使用 java.nio.charset.CharsetDecoder 类。

使用CharsetDecoder您可以指定不同的CodingErrorAction.

java - 为什么 US-ASCII 编码接受非 US-ASCII 字符？

2 回答 2

Related

Reference