1

I have a java application that parses an xml file that was encoded in utf-16le. The xml has been erroring out while being parsed due to illegal xml characters. My solution is to read in this file into a java string, then removing the xml characters, so it can be parsed successfully. It works 99% but there are some slight differences in the input output from this process, not caused by the illegal characters being removed, but going from the utf-16le encoding to java string utf-16.. i think

BufferedReader reader = null;
    String fileText = ""; //stored as UTF-16
    try {
        reader = new BufferedReader(new InputStreamReader(in, "UTF-16LE"));
        for (String line; (line = reader.readLine()) != null; ) {
            fileText += line;
        }
    } catch (Exception ex) {
        logger.log(Level.WARNING, "Error removing illegal xml characters", ex);
    } finally {
        if (reader != null) {
            reader.close();
        }
    }

//code to remove illegal chars from string here, irrelevant to problem 

        ByteArrayInputStream inStream = new ByteArrayInputStream(fileText.getBytes("UTF-16LE"));
    Document doc = XmlUtil.openDocument(inStream, XML_ROOT_NODE_ELEM);

Do characters get changed/lost when going from UTF-16LE to UTF-16? Is there a way to do this in java and assuring the input is exactly the same as the output?

4

1 回答 1

1

Certainly one problem is that readLine throws away the line ending.

You would need to do something like:

       fileText += line + "\r\n";

Otherwise XML attributes, DTD entities, or something else could get glued together where at least a space was required. Also you do not want the text content to be altered when it contains a line break.

Performance (speed and memory) can be improved using a

StringBuilder fileText = new StringBuilder();
... fileText.append(line).append("\n");
... fileText.toString();

Then there might be a problem with the first character of the file, which sometimes redundantly is added: a BOM char.

line = line.replace("\uFEFF", "");
于 2017-11-16T14:41:01.487 回答