python - lxml.html 通过搜索关键字提取字符串

Question

我有一部分 html 如下所示

<li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li>

我想得到字符串“关键字：文本”。

我知道我可以使用 Chrome 检查或 FF firebug 获取上述 html 的 xpath，然后选择（xpath）.extract（），然后剥离 html 标签以获取字符串。但是，该方法不够通用，因为 xpath 在不同页面之间并不一致。

因此，我正在考虑以下方法：首先，使用搜索“关键字：”（代码用于scrapy HtmlXPathSelector，因为我不确定如何在lxml.html中执行相同操作）

hxs = HtmlXPathSelector(response)
hxs.select('//*[contains(text(), "The Keyword:")]')

什么时候做 pprint 我得到一些回报：

>>> pprint( hxs.select('//*[contains(text(), "The Keyword:")]') )
<HtmlXPathSelector xpath='//*[contains(text(), "The Keyword:")]' data=u'<label>The Keyword:</label>'>

我的问题是如何获得想要的字符串：“关键字：文本”。我正在考虑如何确定 xpath，如果 xpath 已知，那么我当然可以得到想要的字符串。

我对 lxml.html 以外的任何解决方案持开放态度。

谢谢。

score 2 · Accepted Answer

from lxml import html

s = '<li><label>The Keyword:</label><span><a href="../../..">The text</a></span></li>'

tree = html.fromstring(s)
text = tree.text_content()
print text

score 1 · Accepted Answer

You can modify the XPath slightly to work with your current structure - by getting the parent of the label, then looking back for the fist a element, and taking the text from that...

>>> tree.xpath('//*[contains(text(), "The Keyword:")]/..//a/text()')
['The text']

But that may not be flexible enough...

python - lxml.html 通过搜索关键字提取字符串

2 回答 2

Related

Reference