python - 使用 Python 从 HTML 页面中提取图像

Question

以下是我的代码。它试图在 html 中的图像标签中获取图像的 src。

import re
for text in open('site.html'):
  matches = re.findall(r'\ssrc="([^"]+)"', text)
  matches = ' '.join(matches)
print(matches)

问题是当我输入类似的东西时：

<img src="asdfasdf">

它可以工作，但是当我放入整个 HTML 页面时，它什么也没有返回。为什么这样做？我该如何解决？

Site.html 只是标准格式的网站的 html 代码。我希望它忽略所有内容，只打印图像的源代码。如果您想查看 site.html 中的内容，请转到基本 HTML 网页并复制所有源代码。

score 10 · Accepted Answer

当您可以使用BeautifulSoup之类的东西轻松做到这一点时，为什么还要使用正则表达式来解析 HTML ：

>>> from bs4 import BeautifulSoup as BS
>>> html = """This is some text
... <img src="asdasdasd">
... <i> More HTML <b> foo </b> bar </i>
... """
>>> soup = BS(html)
>>> for imgtag in soup.find_all('img'):
...     print(imgtag['src'])
... 
asdasdasd

您的代码不起作用的原因是因为text文件的一行。因此，您只能在每次迭代中找到一行的匹配项。尽管这可能有效，但请考虑最后一行是否没有图像标签。matches将是一个空列表，join并将使其变为''. 您matches每行都覆盖变量。

您想调用findall整个 HTML：

import re
with open('site.html') as html:
    content = html.read()
    matches = re.findall(r'\ssrc="([^"]+)"', content)
    matches = ' '.join(matches)

print(matches)

在这里使用with语句更符合 Python 风格。这也意味着您不必file.close()事后打电话，因为该with声明涉及到这一点。

python - 使用 Python 从 HTML 页面中提取图像

1 回答 1

Related

Reference