给定一个 HTML 作为这样的字符串。
<p><strong>This is a text Message.</strong></p>
<ul>
<li>UL 1</li>
<li><strong>UL </strong>2</li>
<li><em>UL 3</em></li>
</ul>
<ol>
<li style="font-weight: bold;"><strong>First statement</strong></li>
<li><strong>Second </strong>Statement</li>
<li>Third <strong>Statement</strong></li>
</ol>
<p>This is another <em>text </em>message.</p>
我尝试编写一些基本代码来提取正确的值。它没有按预期工作。
代码的作用如下。
- 从 HTML 中获取所有子元素。
- 按顺序处理它们并从中提取文本。
问题是由于etc中的strong
嵌套标签造成的......p
public void CreateHtmlToRichText() {
// This is a text Message.
//
// UL 1
// UL 2
// UL 3
// First statement
// Second Statement
// Third Statement
//
// This is another text message.
String htmlString = "<p><strong>This is a text Message.</strong></p>\n" +
"<ul>\n" +
" <li>UL 1</li>\n" +
" <li><strong>UL </strong>2</li>\n" +
" <li><em>UL 3</em></li>\n" +
"</ul>\n" +
"<ol>\n" +
" <li style=\"font-weight: bold;\"><strong>First statement</strong></li>\n" +
" <li><strong>Second </strong>Statement</li>\n" +
" <li>Third <strong>Statement</strong></li>\n" +
"</ol>\n" +
"<p>This is another <em>text </em>message.</p>";
System.out.println(htmlString);
Document document = Jsoup.parse(htmlString);
Elements elements = document.body().children().select("*");
System.out.println("****************************");
System.out.println(elements.size());
Map<Integer, String> paragraphMap = new HashMap<>();
Set<String> newLineSet = new HashSet<>();
newLineSet.add("p");
newLineSet.add("ol");
newLineSet.add("ul");
newLineSet.add("li");
newLineSet.add("br");
int lineNumber = 0;
for (Element element : elements) {
String ownText = element.ownText();
String tagName = element.tagName();
System.out.println("added " + lineNumber + ", " + ownText);
if (newLineSet.contains(tagName)) {
lineNumber++;
}
paragraphMap.put(lineNumber, paragraphMap.getOrDefault(lineNumber, "") + " " + ownText);
System.out.println("*********************");
System.out.println(element);
System.out.println("Tag : " + tagName);
System.out.println("Own Text : " + ownText);
System.out.println("*********************");
}
System.out.println(paragraphMap);
for (int line = 0; line <= lineNumber; line++) {
if (paragraphMap.containsKey(line)) {
System.out.println(paragraphMap.get(line).strip());
}
}
}
输出:
This is a text Message.
UL 1
2 UL
UL 3
First statement
Statement Second
Third Statement
This is another message. text