1

I am writing a Java (7 SE) app to ingest all sorts of text-based inputs, and am concerned about running into character sets/encodings that the JRE doesn't support (for instance this app will run on a Linux box but will be ingesting files generated on every major OS, etc.).

For one, is there a way to catch an IOException (or similar) if the InputStreamReader encounters an unsupported charset/encoding?

And what about inputs that contain multiple encodings? Say we have 4 different types of inputs:

  • Raw java.lang.Strings
  • Plaintext (.txt) files
  • Word (.docx) files
  • PDF files

What if we're reading one of these inputs and we start encountering multiple (but supported) character encodings? Does the JRE natively handle this, or do I have to have multiple readers, each configured with it's own charset/encoding?

In such a case, could I "normalize" the streaming inputs to a single, standardized (UTF-8 most likely) set/encoding? Thanks in advance.

4

1 回答 1

3

要回答您的第一个问题,您可以创建一个 CharsetDecoder 并指定在遇到格式错误的输入时要发生的情况。

CharsetDecoder charsetDecoder = Charset.forName("utf-8").newDecoder();
charsetDecoder.onMalformedInput(myCustomErrorAction);
charsetDecoder.onUnmappableCharacter(myCustomErrorAction);
Reader inputReader = new InputStreamReader(inputStream, charsetDecoder);

至于捕捉不支持整个字符集的情况,它看起来像:

if( Charset.isSupported(encodingSpecified)) {
    //Normal case
} else {
    //Error case
}

但是,我不确定多种编码。我认为单个二进制流具有多种编码是非常不寻常的。流必须有一些自定义方式来指示编码更改。您必须一次从流中读取一个字符来查找该指标。如果您遇到它,则必须使用新编码在同一流上创建一个新阅读器。

在所有情况下,在 Java 中,一旦您从字节流转到字符流,这些字符将在内存中表示而无需任何特定编码,因此无需规范化,除非您将数据保存回某处. 如果您稍后要将该数据保存回文件,那么我强烈建议您选择一种编码并坚持使用它来存储所有数据。

于 2013-02-26T14:12:05.550 回答