java - Java字符串字符集解析

Question

我正在使用 Jsoup API 解析一些网页。但是我得到了 1 个字符集中的页面，并且必须将它们解析为其他

问题：如何将第 1 行解析为第 2 行？

String str1 = "Um grupo ligado &agrave; al-Qaeda assumiu o "
    + "ataque e amea&ccedil;ou fazer outros.";

String str2 = "Um grupo ligado &#224; al-Qaeda assumiu o "
    + "ataque e amea&#231;ou fazer outros.";

//(The text above translate to some news about WTC)

score 0 · Accepted Answer

我还没有真正测试过Jsoup，但是当我需要使用该类将 HTML 转换为 XML 时， JTidy对我非常有帮助org.w3c.tidy.Tidy。这会自动转换实体。

static String str1 = "Um grupo ligado &agrave; al-Qaeda assumiu o "
        + "ataque e amea&ccedil;ou fazer outros.";

public static void main(String[] args) throws Exception {
    System.out.println(cleanData(str1));
}

private static String cleanData(String data) throws UnsupportedEncodingException {
    Tidy tidy = new Tidy();
    tidy.setNumEntities(true); // to num entities
    tidy.setPrintBodyOnly(true); // only print the content
    tidy.setWraplen(Integer.MAX_VALUE); // wrap
    ByteArrayInputStream inputStream = new ByteArrayInputStream(data.getBytes("UTF-8"));
    ByteArrayOutputStream outputStream = new ByteArrayOutputStream();
    tidy.parseDOM(inputStream, outputStream);
    return outputStream.toString("UTF-8");
}

Document如果你愿意，你也可以得到一个实例。

public org.w3c.dom.Document parseDOM(Reader in, Writer out)
public org.w3c.dom.Document parseDOM(InputStream in, OutputStream out)

score 0 · Accepted Answer

我不是该主题的专家，但我相信您正在寻找的答案在http://www.davidcraddock.net/tag/beautifulsoup/

score 0 · Accepted Answer

有点像 JTidy 解决方案：命名实体，likeà是在维护 HTML 的 w3c.org 的 .dtd 文件中定义<!DOCTYPE ...的。将它们复制到本地并解析它们（简单）。然后，您可以立即用 unicode 字符串替换实体，或者制作数字实体。

java - Java字符串字符集解析

3 回答 3

Related

Reference