java - Handling inputs with unsupported and/or multiple charsets in Java

Question

I am writing a Java (7 SE) app to ingest all sorts of text-based inputs, and am concerned about running into character sets/encodings that the JRE doesn't support (for instance this app will run on a Linux box but will be ingesting files generated on every major OS, etc.).

For one, is there a way to catch an IOException (or similar) if the InputStreamReader encounters an unsupported charset/encoding?

And what about inputs that contain multiple encodings? Say we have 4 different types of inputs:

Raw java.lang.Strings
Plaintext (.txt) files
Word (.docx) files
PDF files

What if we're reading one of these inputs and we start encountering multiple (but supported) character encodings? Does the JRE natively handle this, or do I have to have multiple readers, each configured with it's own charset/encoding?

In such a case, could I "normalize" the streaming inputs to a single, standardized (UTF-8 most likely) set/encoding? Thanks in advance.

score 3 · Accepted Answer

要回答您的第一个问题，您可以创建一个 CharsetDecoder 并指定在遇到格式错误的输入时要发生的情况。

CharsetDecoder charsetDecoder = Charset.forName("utf-8").newDecoder();
charsetDecoder.onMalformedInput(myCustomErrorAction);
charsetDecoder.onUnmappableCharacter(myCustomErrorAction);
Reader inputReader = new InputStreamReader(inputStream, charsetDecoder);

至于捕捉不支持整个字符集的情况，它看起来像：

if( Charset.isSupported(encodingSpecified)) {
    //Normal case
} else {
    //Error case
}

但是，我不确定多种编码。我认为单个二进制流具有多种编码是非常不寻常的。流必须有一些自定义方式来指示编码更改。您必须一次从流中读取一个字符来查找该指标。如果您遇到它，则必须使用新编码在同一流上创建一个新阅读器。

在所有情况下，在 Java 中，一旦您从字节流转到字符流，这些字符将在内存中表示而无需任何特定编码，因此无需规范化，除非您将数据保存回某处. 如果您稍后要将该数据保存回文件，那么我强烈建议您选择一种编码并坚持使用它来存储所有数据。

java - Handling inputs with unsupported and/or multiple charsets in Java

1 回答 1

Related

Reference