我正在将网页加载到 iframe 中,并且我想确保所有相关媒体都可用。我目前正在使用请求来下载页面,然后进行一些查找/替换,但这并没有完全覆盖。python有没有办法获取页面加载到浏览器时发出的所有脚本、css和图像请求的列表?
问问题
467 次
1 回答
3
美丽汤
使用BeautifulSoup4获取所有<img>
, <link>
, 和<script>
标签然后拉取相应的属性。
from bs4 import BeautifulSoup
import requests
resp = requests.get("http://www.yahoo.com")
soup = BeautifulSoup(resp.text)
# Pull the linked images (note: will grab base64 encoded images)
images = [img['src'] for img in soup.findAll('img') if img.has_key('src')]
# Checking for src ensures that we don't grab the embedded scripts
scripts = [script['src'] for script in soup.findAll('script') if script.has_key('src')]
# favicon.ico and css
links = [link['href'] for link in soup.findAll('link') if link.has_key('href')]
示例输出:
In [30]: images = [img['src'] for img in soup.findAll('img') if img.has_key('src')]
In [31]: images[:5]
Out[31]:
['http://l.yimg.com/dh/ap/default/130925/My_Yahoo_Defatul_HP_ad_300x250.jpeg',
'http://l.yimg.com/os/mit/media/m/base/images/transparent-95031.png',
'http://l.yimg.com/os/mit/media/m/base/images/transparent-95031.png',
'http://l.yimg.com/os/mit/media/m/base/images/transparent-95031.png',
'http://l.yimg.com/os/mit/media/m/base/images/transparent-95031.png']
于 2013-10-22T18:48:28.227 回答