我还没有真正测试过Jsoup,但是当我需要使用该类将 HTML 转换为 XML 时, JTidy对我非常有帮助org.w3c.tidy.Tidy
。这会自动转换实体。
static String str1 = "Um grupo ligado à al-Qaeda assumiu o "
+ "ataque e ameaçou fazer outros.";
public static void main(String[] args) throws Exception {
System.out.println(cleanData(str1));
}
private static String cleanData(String data) throws UnsupportedEncodingException {
Tidy tidy = new Tidy();
tidy.setNumEntities(true); // to num entities
tidy.setPrintBodyOnly(true); // only print the content
tidy.setWraplen(Integer.MAX_VALUE); // wrap
ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
tidy.parseDOM(inputStream, outputStream);
return outputStream.toString("UTF-8");
}
Document
如果你愿意,你也可以得到一个实例。
public org.w3c.dom.Document parseDOM(Reader in, Writer out)
public org.w3c.dom.Document parseDOM(InputStream in, OutputStream out)