java - 提取 HTML 标记之外的文本

Question

我有以下 HTML 代码：

<div class=example>Text #1</div> "Another Text 1"
<div class=example>Text #2</div> "Another Text 2"

我想提取标签外的文本，“另一个文本 1”和“另一个文本 2”

我正在使用 JSoup 来实现这一点。

有任何想法吗？？？

谢谢！

score 4 · Accepted Answer

一种解决方案是使用该ownText()方法（请参阅 Jsoup文档）。此方法仅返回指定元素拥有的文本，并忽略其直接子元素拥有的任何文本。

仅使用您提供的 html，您可以提取自己的<body>文本：

String html = "<div class='example'>Text #1</div> 'Another Text 1'<div class='example'>Text #2</div> 'Another Text 2'";

Document doc = Jsoup.parse(html);
System.out.println(doc.body().ownText());

将输出：

'Another Text 1' 'Another Text 2'

请注意，该ownText()方法可用于任何Element. 文档中还有另一个示例。

score 2 · Accepted Answer

You can select the next Node (not Element!) of each div-tag. In your example they are all TextNode's.

final String html = "<div class=example>Text #1</div> \"Another Text 1\"\n"
                  + "<div class=example>Text #2</div> \"Another Text 2\" ";

Document doc = Jsoup.parse(html);

for( Element element : doc.select("div.example") ) // Select all the div tags
{
    TextNode next = (TextNode) element.nextSibling(); // Get the next node of each div as a TextNode

    System.out.println(next.text()); // Print the text of the TextNode
}

Output:

 "Another Text 1" 
 "Another Text 2"

java - 提取 HTML 标记之外的文本

2 回答 2

Related

Reference