html - XPath：通过纯文本查找 HTML 元素

Question

请注意：此问题是上一个问题的更精细版本。

我正在寻找一个 XPath，它可以让我在 HTML 文档中找到具有给定纯文本的元素。例如，假设我有以下 HTML：

<html>
<head>...</head>
<body>
    <someElement>This can be found</someElement>
    <nested>
        <someOtherElement>This can <em>not</em> be found most nested</someOtherElement>
    </nested>
    <yetAnotherElement>This can <em>not</em> be found</yetAnotherElement>
</body>
</html>

我需要按文本搜索，并且能够<someElement>使用以下 XPath 找到：

//*[contains(text(), 'This can be found')]

我正在寻找一个类似的 XPath，它可以让我找到<someOtherElement>并<yetAnotherElement>使用纯文本"This can not be found"。以下不起作用：

//*[contains(text(), 'This can not be found')]

我知道这是因为嵌套em元素“扰乱”了“无法找到”的文本流。是否有可能通过 XPaths 以某种方式忽略上述嵌套或类似嵌套？

score 11 · Accepted Answer

您可以使用

//*[contains(., 'This can not be found')]
   [not(.//*[contains(., 'This can not be found')])]

这个 XPath 由两部分组成：

//*[contains(., 'This can not be found')]：运算符.将上下文节点转换为其字符串表示形式。因此，这部分选择在其字符串表示中包含“无法找到”的所有节点。在上面的例子中，这是<someOtherElement>, <yetAnotherElement> and: <body> and <html>。
[not(.//*[contains(., 'This can not be found')])]：这将删除具有仍包含纯文本“无法找到”的子元素的节点。它删除了不需要的节点<body>，并且<html>在上面的示例中。

您可以在此处试用这些 XPath 。

html - XPath：通过纯文本查找 HTML 元素

1 回答 1

Related

Reference