java - Tidy 打破了与非拉丁字符的链接

Question

我使用 java 库 Tidy 来清理 html 代码。一些代码包含带有俄文字母的链接。例如

<a href="http://example.com/Русский">link with Russian letters</a>

我知道“Русский”必须被转义，但我从用户那里得到了这个 html。我的工作是将它转换为 XHTML。

我认为 tidy 试图逃避非拉丁字母，但结果我得到了

<a href="http://example.com/%420%443%441%441%43A%438%439">link with Russian letters</a>

这不正确。正确的版本是

<a href="http://example.com/%D0%A0%D1%83%D1%81%D1%81%D0%BA%D0%B8%D0%B9">link with Russian letters</a>

Java 代码是

private static Tidy getTidy() {
    if (null == tidy) {
      tidy = new Tidy();
      tidy.setQuiet(true);
      tidy.setShowErrors(0);
      tidy.setShowWarnings(false);
      tidy.setXHTML(true);
      tidy.setOutputEncoding("UTF-8");
    }
    return tidy;
}

public static String sanitizeHtml(String html, URI pageUri) {
    boolean escapeMedia = false;
    String ret = "";
    try {
      Document doc = getTidy().parseDOM(new StringReader("<body>" + html + "</body>"), null);

      // here I make some processing

      // string output
      ByteArrayOutputStream out = new ByteArrayOutputStream();
      Node node = doc.getElementsByTagName("body").item(0);
      getTidy().pprint(node, out);
      ret = out.toString().trim();
    }
    catch (Exception e) {
      ret = html;
      e.printStackTrace();
    }

    return ret;
}

score 1 · Accepted Answer

这是一个硬编码的行为，它可能是一个错误。当他们应该使用 UTF-8 时，他们使用 UTF-16 来转义 URL 中的非 ASCII 字符。见org/w3c/tidy/AttrCheckImpl.java。

java - Tidy 打破了与非拉丁字符的链接

1 回答 1

Related

Reference