xpath - scrapy HtmlXPathSelector 通过搜索关键字确定xpath

Question

我有一部分 html 如下所示

<li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li>

我想得到字符串“关键字：文本”。

我知道我可以使用 Chrome 检查或 FF firebug 获取上述 html 的 xpath，然后使用 hxs.select(xpath).extract()，然后剥离 html 标签以获取字符串。但是，该方法不够通用，因为 xpath 在不同页面之间并不一致。

因此，我正在考虑以下方法：首先，使用搜索“关键字：”

hxs = HtmlXPathSelector(response)
hxs.select('//*[contains(text(), "The Keyword:")]')

什么时候做 pprint 我得到一些回报：

>>> pprint( hxs.select('//*[contains(text(), "The Keyword:")]') )
<HtmlXPathSelector xpath='//*[contains(text(), "The Keyword:")]' data=u'<label>The Keyword:</label>'>

我的问题是如何获得想要的字符串：“关键字：文本”。我正在考虑如何确定 xpath，如果 xpath 已知，那么我当然可以得到想要的字符串。

除了scrapy HtmlXPathSelector之外，我对任何解决方案都持开放态度。（例如 lxml.html 可能有更多功能，但我对它很陌生）。

谢谢。

score 0 · Accepted Answer

如果我正确理解了您的问题，那么“跟随兄弟”就是您所要照顾的。

 //*[contains(text(), "The Keyword:")]/following-sibling::span/a/text()

Xpath 轴

xpath - scrapy HtmlXPathSelector 通过搜索关键字确定xpath

1 回答 1

Related

Reference