python - 只下载带有 urllib2 的 html 页面

Question

我正在尝试使用 urllib2 和 beautifulsoup 来爬网。但是我的代码内存不足，其中一些链接如下：

http://downloads.graboidvideo.com/download_filter.php?file=GraboidVideoSetup.pkg&platform=Mac

这是一个视频下载链接。当我使用 urllib2.urlopen() 时，它将下载视频，这不是我想要的。有没有办法只下载url的html？如果 url 引用视频文件或其他文件，我基本上想跳过它，但我不知道该怎么做。

我的代码如下：

toy_url=http://downloads.graboidvideo.com/download_filter.php?file=GraboidVideoSetup.pkg&platform=Mac
headers = {'USER-Agent':'crawltaosof'}
req = urllib2.Request(url, None,headers)
page = urllib2.urlopen(req,timeout=0.51).read()

score 5 · Accepted Answer

read()考虑在使用该方法之前检查响应标头。这是一个例子。

>>> import urllib2
>>>
>>> request = urllib2.Request('http://downloads.graboidvideo.com/download_filter
.php?file=GraboidVideoSetup.pkg&platform=Mac')
>>> response = urllib2.urlopen(request)
>>>
>>> print response.info().getheader('Content-Type')
application/octet-stream
>>>
>>>
>>> request = urllib2.Request('http://www.yahoo.com')
>>> response = urllib2.urlopen(request)
>>>
>>> print response.info().getheader('Content-Type')
text/html;charset=utf-8

最终，您将希望在响应标头中进行测试，Content-Type并使用它来确保它是类型text\html，然后再通过您的网络爬虫运行 url。如果您想了解其他类型，请参阅有关Internet 媒体类型text的维基百科文章。

python - 只下载带有 urllib2 的 html 页面

1 回答 1

Related

Reference