python - 从没有类或 id 的 HTML 中选择图像标签 attr

Question

我有通过 Python 和 Lxml 解析的 HTML 页面。问题是我必须从没有任何类或 id 属性的 HTML 图像标签的值中获取。像这样：

<table cellspacing="0" cellpadding="0" border="0">
<tbody><tr>
<td align="left" valign="top" style="padding: 0 10px 0 60px;">
<img src="/files/135.jpg" width="64" height="64">
</td>
<td align="left" valign="middle"><h1>Archer / Арчер</h1>
</td>
</tr>
</tbody></table>

所以，为了解决我的任务，我有一个问题 - 是否可以编写类似 jquery 的表达式来从这个 HTML 中选择图像标签，或者我必须通过迭代所有 img 标签并获取具有特定宽度和高度的 src 属性来提取属性?

score 0 · Accepted Answer

您应该尝试xpath - lxml 支持。您可以使用 mozilla firefox 的 firepath 插件来稍微使用 xpath。你的 xpath 表达式的结尾可以是 (width > 64 ?) ........../img[@border="0"]

score 0 · Accepted Answer

此xpath查询适用于您的示例数据：

import lxml.html

root = lxml.html.fromstring('your sample data').getroottree()
root.xpath("//img[@width='64' and @height='64']/@src")
# ['/files/135.jpg']

python - 从没有类或 id 的 HTML 中选择图像标签 attr

2 回答 2

Related

Reference