java - 使用 jsoup 解析保留非 HTML 元素

Question

我是 jsoup 的新手，在使用非 HTML 元素（脚本）时遇到了一些困难。我有以下 HTML：

<$if not dcSnippet$>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<meta http-equiv="generator" content="Outside In HTML Converter version 8.4.0"/>
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8" />
<title></title>
</head>

<$endif$>
<div style="position:relative">
<p style="text-align: left; font-family: times; font-size: 10pt; font-weight: normal; font-style: normal; text-decoration: none"><span style="font-weight: normal; font-style: normal">This is a test document.</span></p>
</div>
<$if not dcSnippet$>
</body>
</html>
<$endif$>

用于显示此内容的应用程序知道如何处理这些 <if dcSnippet$> 等语句。所以，当我简单地用 jsoup 解析文本时，< 和 > 被编码并且 html 被重新组织，所以它不能正确执行或显示。像这样：

<html xmlns="http://www.w3.org/1999/xhtml"><head></head><body>&lt;$if not dcSnippet$&gt;
<meta http-equiv="generator" content="Outside In HTML Converter version 8.4.0">
<meta http-equiv="Content-Type" content="text/html; charset=UTF-8">
<title></title>
&lt;$endif$&gt;
<div style="position:relative">
<p style="text-align: left; font-family: times; font-size: 10pt; font-weight: normal; font-style: normal; text-decoration: none"><span style="font-weight: normal; font-style: normal">This is a test document.</span></p>
</div>
&lt;$if not dcSnippet$&gt;
&lt;$endif$&gt;
</body></html>

我的最终目标是添加一些 css 和 js 包含，并修改几个元素属性。这不是问题，我已经解决了很多问题。问题是我不知道如何保留非 HTML 元素并将格式保存在与原始元素相同的位置。到目前为止，我的解决方案是这样的：

读入 HTML 文件，并遍历它，删除包含非 html 元素的行。
使用纯 HTML 创建一个 Document 对象
进行我的修改
返回 HTML 并重新插入我首先删除的非 HTML 元素（脚本）。
将文档保存到文件系统

这目前有效，只要非 HTML 的位置是可预测的，到目前为止它是可预测的。但我想知道是否有更好的方法来做到这一点，所以我不必先“清理”HTML，然后手动重新引入我稍后删除的内容。这是我的代码的要点（希望我没有错过太多声明）：

String newLine();
FileReader fr = new FileReader(inputFile);
BufferedReader br = new BufferedReader(fr);
while ((thisLine = br.readLine()) != null) {
    if (thisLine.matches(".*<\\$if.*\\$>")) {
        ifStatement = thisLine + "\n";
    } else if (thisLine.matches(".*<\\$endif\\$>")) {
        endifStatement = thisLine + "\n";
    } else { 
        tempHtml += thisLine + "\n";
    }
}
br.close();

Document doc = Jsoup.parse(tempHtml, "UTF-8");
doc.outputSettings().prettyPrint(false).escapeMode(EscapeMode.extended);

Element head = doc.head();
Element body = doc.body();
Element firstDiv = body.select("div").first();

[... perform my element and attribute inserts ...]

body.prependText("\n" + endifStatement);
body.appendText("\n" + ifStatement);
String fullHtml = (ifStatement + doc.toString().replaceAll("\\&lt;", "<").replaceAll("\\&gt;", ">") + "\n" + endifStatement);

BufferedWriter htmlWriter = new BufferedWriter(new OutputStreamWriter(new FileOutputStream(outputFile), "UTF-8"));
htmlWriter.write(fullHtml);
htmlWriter.flush();
htmlWriter.close();

非常感谢您的任何帮助或意见！

score 0 · Accepted Answer

问题是我不知道如何保留非 HTML 元素并将格式保存在与原始元素相同的位置。

Jsoup 是一个 HTML 解析器。您提供的“HTML 文件”不包含 HTML。它更像是一个用类似 HTML 的语言编写的模板文件。

因此，Jsoup 最多会将此模板文件视为无效的 HTML 文件。这就是为什么所有非 HTML 元素都会被转义的原因。

为了实现您的需求，您必须编写自定义模板解析器。Jsoup 确实提供了一些通用类，可以使这项任务变得非常容易。

但是，根据设计，这些通用类仅供内部使用。

这给我们留下了四个选择：

您的实际解决方案
Feed Jsoup with pure HTML
向 Jsoup 团队发送问题
请求创建自定义解析器的能力
编写一个更健壮的自定义解析器
这是一个重新发明轮子的解决方案 IMO
更改（如果可行）您当前的模板语言
检查mustache，例如百里香

java - 使用 jsoup 解析保留非 HTML 元素

1 回答 1

Related

Reference