python - BeautifulSoup find() 的 Lxml 等效项

Question

我最近从 Beautifulsoup 切换到 lxml，因为 lxml 可以处理损坏的 HTML，这是我的情况。我想知道完成 Beautifulsoup find() 的等效形式或程序形式是什么。您在 BS 中看到，我可以通过如下搜索找到树节点：

bs = BeautifulSoup(html)
bs.find('span', {'class': 'some-class-name'})

lxml find() 只是在树上搜索当前级别，如果我想在所有树节点中搜索怎么办？

谢谢

score 2 · Accepted Answer

您可以使用cssselect：

root = lxml.html.fromstring(html)
root.cssselect('span.some-class-name')

root.xpath('.//span[@class="some-class-name"]')

两种方法都返回匹配元素的列表，例如cssselectBeautifulSoup中的方法。xpathfindAll/find_all

score 1 · Accepted Answer

如果您不想费心学习 api for lxmlorxpath表达式，那么这里是另一种选择：

Beautiful Soup 支持 Python 标准库中包含的 HTML 解析器，但它也支持许多第三方 Python 解析器。一个是 lxml 解析器 [...]

并指定要使用的特定解析器：

BeautifulSoup(markup, "lxml")

2 回答 2