htmlunit - HtmlUnit - HTMLParser（带有字符的页面）

Question

我有一个资源（一个静态 html 页面），我想用它来测试。但是，当我得到静态页面时，它带有一些字符编码。我尝试使用 StringEscapeUtils 类，但它不起作用。我的功能：

  private HtmlPage getStaticPage() throws IOException, ClassNotFoundException {
    final Reader reader = new InputStreamReader(this.getClass().getResourceAsStream("/" + "testPage" + ".html"), "UTF-8");
    final StringWebResponse response = new StringWebResponse(StringEscapeUtils.unescapeHtml4(IOUtils.toString(reader)), StandardCharsets.UTF_8, new URL(URL_PAGE));
    return HTMLParser.parseHtml(response, WebClientFactory.getInstance().getCurrentWindow());
}

导入 org.apache.commons.lang3.StringEscapeUtils；

score 0 · Accepted Answer

final Reader reader = new InputStreamReader(this.getClass().getResourceAsStream("/" + "testPage" + ".html"), "UTF-8");

对于读者使用文件的编码（从您的评论中我猜这是您的情况下的 windows-1252）。然后将文件读入字符串（例如使用commons.io）。

然后你可以像这样处理它

final StringWebResponse tmpResponse = new StringWebResponse(anHtmlCode,
    new URL("http://www.wetator.org/test.html"));
final WebClient tmpWebClient = new WebClient(aBrowserVersion);
try {
  final HtmlPage tmpPage = HTMLParser.parseHtml(tmpResponse, tmpWebClient.getCurrentWindow());
  return tmpPage;
} finally {
  tmpWebClient.close();
}

如果您仍然有问题，请从您的页面中制作一个简单的示例来显示您的问题，并将其与您的代码一起上传到此处。

htmlunit - HtmlUnit - HTMLParser（带有字符的页面）

1 回答 1

Related

Reference