python - python lxml查找标签

Question

我正在使用 lxml 解析一个 html，该 html 具有如下所示的 facebook 评论标签：

<fb:comments id="fb_comments"  href="http://example.com" num_posts="5" width="600"></fb:comments>

我正在尝试选择它来获取 href 值，但是当我这样做时，cssselect('fb:comments')我收到以下错误：

The pseudo-class Symbol(u'comments', 3) is unknown

有没有办法做到这一点？

编辑： 代码：

from lxml.html import fromstring
html = '...'
parser = fromstring(html)
parser.cssselect('fb:comments')  #raises the exception

score 3 · Accepted Answer

该cssselect()方法使用给定的CSS 选择器表达式解析文档。在您的情况下，冒号字符 ( :) 是 XML 名称空间前缀分隔符 (ie <namespace:tagname/>)，它与 CSS 伪类语法 (ie tagname:pseudo-class) 混淆。

根据lxml 手册，您应该使用namespace-prefix|element语法 incssselect()来查找comments带有命名空间前缀 ( ) 的标记 ( fb)。所以：

from lxml.html import fromstring
html = '...'
parser = fromstring(html)
parser.cssselect('fb|comments')

1 回答 1