java - transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8") 不工作

Question

我有以下方法将 XMLDom 写入流：

public void writeToOutputStream(Document fDoc, OutputStream out) throws Exception {
    fDoc.setXmlStandalone(true);
    DOMSource docSource = new DOMSource(fDoc);
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.INDENT, "no");
    transformer.transform(docSource, new StreamResult(out));
}

我正在测试其他一些 XML 功能，这只是我用来写入文件的方法。我的测试程序生成了 33 个测试用例，其中文件被写出。其中 28 个具有以下标题：

<?xml version="1.0" encoding="UTF-8"?>...

但由于某种原因，现在有 1 个测试用例产生：

<?xml version="1.0" encoding="ISO-8859-1"?>...

还有四种产品：

<?xml version="1.0" encoding="Windows-1252"?>...

如您所见，我将 ENCODING 输出键设置为 UTF-8。这些测试用于在早期版本的 Java 上工作。我已经有一段时间（一年多）没有运行测试了，但是今天在“Java(TM) SE 运行时环境（build 1.6.0_22-b04）”上运行我得到了这个有趣的行为。

我已经验证导致问题的文档是从最初具有这些编码的文件中读取的。似乎新版本的库正在尝试保留已读取源文件的编码。但这不是我想要的……我真的希望输出为 UTF-8。

有谁知道可能导致转换器忽略 UTF-8 编码设置的任何其他因素？为了忘记最初读取的文件的编码，是否需要在文档上设置其他任何内容？

更新：

我在另一台机器上检查了同一个项目，在那里构建并运行了测试。在那台机器上，所有的测试都通过了！所有文件的标题中都有“UTF-8”。该机器具有“Java(TM) SE Runtime Environment (build 1.6.0_29-b11)” 两台机器都运行 Windows 7。在正常工作的新机器上，jdk1.5.0_11 用于构建，但在旧机器上机器 jdk1.6.0_26 用于构建。用于两个构建的库完全相同。会不会是 JDK 1.6 在构建时与 1.5 不兼容？

更新：

4.5年后，Java库仍然坏掉，但由于下面Vyrx的建议，我终于有了一个合适的解决方案！

public void writeToOutputStream(Document fDoc, OutputStream out) throws Exception {
    fDoc.setXmlStandalone(true);
    DOMSource docSource = new DOMSource(fDoc);
    Transformer transformer = TransformerFactory.newInstance().newTransformer();
    transformer.setOutputProperty(OutputKeys.METHOD, "xml");
    transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8");
    transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
    transformer.setOutputProperty(OutputKeys.INDENT, "no");
    out.write("<?xml version=\"1.0\" encoding=\"UTF-8\"?>".getBytes("UTF-8"));
    transformer.transform(docSource, new StreamResult(out));
}

解决方案是禁用标头的写入，并在将 XML 序列化到输出流之前写入正确的标头。蹩脚，但它产生正确的结果。4 年前打破的测试现在再次运行！

score 2 · Accepted Answer

要回答以下代码对我有用的问题。这可以采用输入编码并将数据转换为输出编码。

        ByteArrayInputStream inStreamXMLElement = new ByteArrayInputStream(strXMLElement.getBytes(input_encoding));
        DocumentBuilderFactory dbf = DocumentBuilderFactory.newInstance();
        DocumentBuilder db = dbf.newDocumentBuilder(); 
        Document docRepeat = db.parse(new InputSource(new InputStreamReader(inStreamXMLElement, input_encoding)));
        Node elementNode = docRepeat.getElementsByTagName(strRepeat).item(0);

        TransformerFactory tFactory = null;
        Transformer transformer = null;
        DOMSource domSourceRepeat = new DOMSource(elementNode);
        tFactory = TransformerFactory.newInstance();
        transformer = tFactory.newTransformer();
        transformer.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
        transformer.setOutputProperty(OutputKeys.ENCODING, output_encoding);

        ByteArrayOutputStream bos = new ByteArrayOutputStream();
        StreamResult sr = new StreamResult(new OutputStreamWriter(bos, output_encoding));


        transformer.transform(domSourceRepeat, sr);
        byte[] outputBytes = bos.toByteArray();
        strRepeatString = new String(outputBytes, output_encoding);

score 2 · Accepted Answer

序列化表情符号字符时，我在 Android 上遇到了同样的问题。在转换器中使用 UTF-8 编码时，输出是 HTML 字符实体（UTF-16 代理对），随后会破坏读取数据的其他解析器。

这就是我最终解决它的方式：

StringWriter sw = new StringWriter();
sw.write("<?xml version=\"1.0\" encoding=\"UTF-8\" ?>");
Transformer t = TransformerFactory.newInstance().newTransformer();

// this will work because we are creating a Java string, not writing to an output
t.setOutputProperty(OutputKeys.ENCODING, "UTF-16"); 
t.setOutputProperty(OutputKeys.OMIT_XML_DECLARATION, "yes");
t.transform(new DOMSource(elementNode), new StreamResult(sw));

return IOUtils.toInputStream(sw.toString(), Charset.forName("UTF-8"));

score 1 · Accepted Answer

我花了很多时间调试这个问题，因为它在我的机器上运行良好（Ubuntu 14 + Java 1.8.0_45），但在生产环境中运行不正常（Alpine Linux + Java 1.7）。

与我的预期相反，上述答案没有帮助。

ByteArrayOutputStream bos = new ByteArrayOutputStream();
StreamResult sr = new StreamResult(new OutputStreamWriter(bos, "UTF-8"));

但这一个按预期工作

val out = new StringWriter()
val result = new StreamResult(out)

score 1 · Accepted Answer

我可以通过包装传递给 DOMSource 构造函数的 Document 对象来解决这个问题。我的包装器的 getXmlEncoding 方法总是返回 null，所有其他方法都委托给包装的 Document 对象。

score 0 · Accepted Answer

关于什么？：

public static String documentToString(Document doc) throws Exception{ return(documentToString(doc,"UTF-8")); }//
   public static String documentToString(Document doc, String encoding) throws Exception{
     TransformerFactory transformerFactory =TransformerFactory.newInstance();
     Transformer transformer = null;

if ( "".equals(validateNullString(encoding) ) ) encoding = "UTF-8";
try{
    transformer = transformerFactory.newTransformer();
    transformer.setOutputProperty(OutputKeys.INDENT, "yes") ;
    transformer.setOutputProperty(OutputKeys.ENCODING, encoding) ;
}catch (javax.xml.transform.TransformerConfigurationException error){
    return null;
}

Source source = new DOMSource(doc);    
StringWriter writer = new StringWriter();
Result result = new StreamResult(writer);

try{
    transformer.transform(source,result);
}catch (javax.xml.transform.TransformerException error){
    return null;
}
return writer.toString();    
}//documentToString

score 0 · Accepted Answer

使用 Saxon TransformerFactoryImpl，（Saxon HE >= 10.3）：

public void writeToStream(Document doc, OutputStream output) throws TransformerException, IOException
    {
        TransformerFactory transformerFactory =
            TransformerFactory.newInstance("net.sf.saxon.TransformerFactoryImpl", null);
        transformerFactory.setFeature(XMLConstants.FEATURE_SECURE_PROCESSING, true);
        transformerFactory.setAttribute(XMLConstants.ACCESS_EXTERNAL_DTD, "");
        transformerFactory.setAttribute(XMLConstants.ACCESS_EXTERNAL_STYLESHEET, "");
        Transformer transformer = transformerFactory.newTransformer();
        DOMSource source = new DOMSource(doc);
        transformer.transform(source, new StreamResult(output));

    }

这解决了我身边的这个问题。

score -1 · Accepted Answer

我在这里进行了一次疯狂的拍摄，但是您提到您正在读取文件以获取测试数据。您能否确保您使用正确的编码读取文件，因此当您写入 OutputStream 时，您已经拥有正确编码的数据？

所以有类似 new InputStreamReader(new FileInputStream(fileDir), "UTF8") 的东西。

不要忘记 FileReader 的单参数构造函数始终使用平台默认编码：此类的构造函数假定默认字符编码和默认字节缓冲区大小是适当的。

score -1 · Accepted Answer

尝试在您的 StreamResult 上专门设置编码：

StreamResult result = new StreamResult(new OutputStreamWriter(out, "UTF-8"));

这样，它应该只能以 UTF-8 写出。

java - transformer.setOutputProperty(OutputKeys.ENCODING, "UTF-8") 不工作

8 回答 8

Related

Reference