python - XPath: Find HTML element by plain text

Question

Please note: A more refined version of this question, with an appropriate answer can be found here.

I would like to use the Selenium Python bindings to find elements with a given text on a web page. For example, suppose I have the following HTML:

<html>
    <head>...</head>
    <body>
        <someElement>This can be found</someElement>
        <someOtherElement>This can <em>not</em> be found</someOtherElement>
    </body>
</html>

I need to search by text and am able to find <someElement> using the following XPath:

//*[contains(text(), 'This can be found')]

I am looking for a similar XPath that lets me find <someOtherElement> using the plain text "This can not be found". The following does not work:

//*[contains(text(), 'This can not be found')]

I understand that this is because of the nested em element that "disrupts" the text flow of "This can not be found". Is it possible via XPaths to, in a way, ignore such or similar nestings as the one above?

score 18 · Accepted Answer

您可以使用//*[contains(., 'This can not be found')].

.在与“无法找到”进行比较之前，上下文节点将转换为其字符串表示形式。

不过要小心，因为您使用的是//*，因此它将匹配包含此字符串的所有englobing 元素。

在您的示例中，它将匹配：

<someOtherElement>
和<body>
和<html>！

您可以通过定位文档中的特定元素标签或特定部分（a<table>或<div>具有已知 id 或类）来限制这一点

编辑 OP 的问题，评论如何找到与文本条件匹配的最嵌套元素：

这里接受的答案建议//*[count(ancestor::*) = max(//*/count(ancestor::*))]选择最嵌套的元素。我认为它只是 XPath 2.0。

结合您的子字符串条件时，我可以使用此文档在此处对其进行测试

<html>
<head>...</head>
<body>
    <someElement>This can be found</someElement>
    <nested>
        <someOtherElement>This can <em>not</em> be found most nested</someOtherElement>
    </nested>
    <someOtherElement>This can <em>not</em> be found</someOtherElement>
</body>
</html>

并使用此 XPath 2.0 表达式

//*[contains(., 'This can not be found')]
   [count(ancestor::*) = max(//*/count(./*[contains(., 'This can not be found')]/ancestor::*))]

它匹配包含“This can not be found most nested”的元素。

可能有一种更优雅的方式来做到这一点。

python - XPath: Find HTML element by *plain* text

1 回答 1

Related

Reference

python - XPath: Find HTML element by plain text