python - 需要python lxml语法帮助来解析html

Question

我是 python 的新手，我需要一些关于使用 lxml 查找和迭代 html 标签的语法方面的帮助。以下是我正在处理的用例：

HTML 文件的格式相当好（但并不完美）。屏幕上有多个表格，一个包含一组搜索结果，一个用于页眉和页脚。每个结果行都包含一个搜索结果详细信息的链接。

我需要找到带有搜索结果行的中间表（我能够弄清楚这一行）：

    self.mySearchTables = self.mySearchTree.findall(".//table")
    self.myResultRows = self.mySearchTables[1].findall(".//tr")

我需要找到此表中包含的链接（这是我卡住的地方）：
```
    for searchRow in self.myResultRows:
        searchLink = patentRow.findall(".//a")
```
它似乎并没有真正找到链接元素。
我需要链接的纯文本。我想searchLink.text如果我实际上首先获得了链接元素，那将会是这样的。

最后，在 lxml 的实际 API 参考中，我无法找到有关 find 和 findall 调用的信息。我从在谷歌上找到的一些代码中收集到了这些。我是否遗漏了有关如何使用 lxml 有效查找和迭代 HTML 标记的内容？

score 27 · Accepted Answer

好的，首先，关于解析 HTML：如果您遵循 zweiterlinde 和 S.Lott 的建议，至少使用lxml 中包含的 beautifulsoup版本。这样，您还将获得一个不错的 xpath 或 css 选择器界面的好处。

但是，我个人更喜欢 lxml 中包含的 Ian Bicking 的HTML 解析器。

其次，.find()来自.findall()lxml 试图与 ElementTree 兼容，这两种方法在 ElementTree 的 XPath Support 中进行了描述。

这两个函数相当容易使用，但它们的 XPath 非常有限。我建议尝试使用完整的 lxmlxpath()方法，或者，如果您已经熟悉 CSS，请使用cssselect()method。

以下是一些示例，其 HTML 字符串解析如下：

from lxml.html import fromstring
mySearchTree = fromstring(your_input_string)

使用 css 选择器类，您的程序大致如下所示：

# Find all 'a' elements inside 'tr' table rows with css selector
for a in mySearchTree.cssselect('tr a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

使用 xpath 方法的等价物是：

# Find all 'a' elements inside 'tr' table rows with xpath
for a in mySearchTree.xpath('.//tr/*/a'):
    print 'found "%s" link to href "%s"' % (a.text, a.get('href'))

score 5 · Accepted Answer

你没有在这个项目中使用Beautiful Soup有什么原因吗？这将使处理不完美的文档变得更加容易。

python - 需要python lxml语法帮助来解析html

2 回答 2

Related

Reference