xpath - 通过 XPath 提取节点之间的文本

Question

我正在尝试通过 XPath 读取网页的特定部分。该页面的格式不是很好，但我无法更改...

<root>
    <div class="textfield">
        <div class="header">First item</div>
        Here is the text of the <strong>first</strong> item.
        <div class="header">Second item</div>
        <span>Here is the text of the second item.</span>
        <div class="header">Third item</div>
        Here is the text of the third item.
    </div>
    <div class="textfield">
        Footer text
    </div>
</root>

我想提取各种项目的文本，即标题 div 之间的文本（例如，“这是第一项的文本。”）。到目前为止，我已经使用了这个 XPath 表达式：

//text()[preceding::*[@class='header' and contains(text(),'First item')] and following::*[@class='header' and contains(text(),'Second item')]]

但是，我不能硬编码结束项目名称，因为在我想抓取的页面中，项目的顺序不同（例如，“第一项”可能跟在“第三项”之后）。

任何有关如何调整我的 XPath 查询的帮助将不胜感激。

score 2 · Accepted Answer

找到了！

//text()[preceding::*[@class='header' and contains(text(),'First item')]][following::*[preceding::*[@class='header'][1][contains(text(),'First item')]]]

事实上，您的解决方案 Aleh 不适用于文本中的标签。

现在，剩下的情况是最后一项，它后面没有带有 class=header 的元素；因此它将包括所有找到的文本，直到文档末尾。想法？

score 2 · Accepted Answer

//*[@class='header' and contains(text(),'First item')]/following::text()[1]将选择之后的第一个文本节点<div class="header">First item</div>。
//*[@class='header' and contains(text(),'Second item')]/following::text()[1]之后将选择第一个文本节点<div class="header">Second item</div>，依此类推
编辑：对不起，这不适用于<strong>案例。将更新我的答案
EDIT2：使用@Michiel 部分。看起来像 omg 但有效：//div[@class='textfield'][1]//text()[preceding::*[@class='header' and contains(text(),'First item')]][following::*[preceding::*[not(self::strong) and not(self::span)][1][contains(text(),'First item')]] or not(//*[preceding::*[@class='header' and contains(text(),'First item')]])]
似乎这应该用更好的解决方案来解决:)

score 1 · Accepted Answer

为了完整起见，最终查询由整个线程中的各种建议组成：

//*[
    @class='textfield' and position() = 1
]
//text() [
    preceding::*[
        @class='header' and contains(text(),'First item')
    ]
][
    following::*[
        preceding::*[
            @class='header'
        ][1][
            contains(text(),'First item')
        ]
    ]
]

xpath - 通过 XPath 提取节点之间的文本

3 回答 3

Related

Reference