I am writing a Java (7 SE) app to ingest all sorts of text-based inputs, and am concerned about running into character sets/encodings that the JRE doesn't support (for instance this app will run on a Linux box but will be ingesting files generated on every major OS, etc.).
For one, is there a way to catch an IOException
(or similar) if the InputStreamReader
encounters an unsupported charset/encoding?
And what about inputs that contain multiple encodings? Say we have 4 different types of inputs:
- Raw
java.lang.String
s - Plaintext (
.txt
) files - Word (
.docx
) files - PDF files
What if we're reading one of these inputs and we start encountering multiple (but supported) character encodings? Does the JRE natively handle this, or do I have to have multiple readers, each configured with it's own charset/encoding?
In such a case, could I "normalize" the streaming inputs to a single, standardized (UTF-8 most likely) set/encoding? Thanks in advance.