html - Xpath 定位网站文本

Question

我正在尝试将我的 xpath 设置为仅针对页面文本内容，但是文章“关于作者”下方的部分不断被包含在内，我希望 xpath 仅针对文章文本+标题。

到目前为止我的xpath：

//*[@class="content"]//p[not(contains(@id, "author-bio"))] |
//*[@id="content_wrapper"]//h1

这有效，但不会按预期删除关于作者部分。我正在处理下面的文章。

http://www.intomobile.com/2013/11/05/samsung-galaxy-s3-android-43-update-rolling-out-international-users/

我正在使用 firepath 扩展至 firefox/firebug，它可以让我查看我的目标元素。

score 1 · Accepted Answer

该特定文档是 XHTML，它的根元素为

<html xmlns="http://www.w3.org/1999/xhtml" dir="ltr" lang="en-US"
xmlns:og="http://opengraphprotocol.org/schema/"
xmlns:fb="http://www.facebook.com/2008/fbml">

这xmlns="..."意味着html元素（及其所有不带前缀的后代）在http://www.w3.org/1999/xhtml命名空间中。现在 XPath 表达式中的无前缀名称指的是不在命名空间中的节点，所以

//p[not(contains(@id, "author-bio"))]

正在寻找p在没有命名空间中命名的元素，并且不会匹配命名空间p中http://www.w3.org/1999/xhtml命名的元素。

正确的方法是将前缀映射到该名称空间 URI 并在 XPath 表达式中使用前缀，例如

//xhtml:p[not(contains(@id, "author-bio"))]

但究竟如何定义前缀映射取决于您使用的 XPath 引擎。如果您的工具没有提供进行前缀映射的方法，那么您必须在上使用谓词local-name()，例如

//*[local-name() = 'p'][not(contains(@id, "author-bio"))]

这同样适用于h1，您需要绑定并使用前缀或使用*[local-name() = 'h1']技巧。

score 0 · Accepted Answer

id('home_right_column')//p[not(ancestor:: [@id= 'author-bio'])] | // [@id="content_wrapper"]//h1

自己搞定了：）

html - Xpath 定位网站文本

2 回答 2

Related

Reference