python - 我正在使用 Python urllib2 下载文件。如何检查文件大小？

Question

如果它很大......然后停止下载？我不想下载大于 12MB 的文件。

request = urllib2.Request(ep_url)
request.add_header('User-Agent',random.choice(agents))
thefile = urllib2.urlopen(request).read()

score 20 · Accepted Answer

没有必要像bobince那样直接使用 httplib。您可以直接使用 urllib 完成所有这些操作：

>>> import urllib2
>>> f = urllib2.urlopen("http://dalkescientific.com")
>>> f.headers.items()
[('content-length', '7535'), ('accept-ranges', 'bytes'), ('server', 'Apache/2.2.14'),
 ('last-modified', 'Sun, 09 Mar 2008 00:27:43 GMT'), ('connection', 'close'),
 ('etag', '"19fa87-1d6f-447f627da7dc0"'), ('date', 'Wed, 28 Oct 2009 19:59:10 GMT'),
 ('content-type', 'text/html')]
>>> f.headers["Content-Length"]
'7535'
>>>

如果您使用 httplib，那么您可能必须实现重定向处理、代理支持以及 urllib2 为您提供的其他好处。

score 7 · Accepted Answer

你可以说：

maxlength= 12*1024*1024
thefile= urllib2.urlopen(request).read(maxlength+1)
if len(thefile)==maxlength+1:
    raise ThrowToysOutOfPramException()

但是当然你仍然读取了 12MB 的不需要的数据。如果您想将发生这种情况的风险降到最低，您可以检查 HTTP Content-Length 标头（如果存在）（可能不存在）。但要做到这一点，您需要下拉到httplib而不是更通用的 urllib。

u= urlparse.urlparse(ep_url)
cn= httplib.HTTPConnection(u.netloc)
cn.request('GET', u.path, headers= {'User-Agent': ua})
r= cn.getresponse()

try:
    l= int(r.getheader('Content-Length', '0'))
except ValueError:
    l= 0
if l>maxlength:
    raise IAmCrossException()

thefile= r.read(maxlength+1)
if len(thefile)==maxlength+1:
    raise IAmStillCrossException()

如果您愿意，您也可以在要求获取文件之前检查长度。这与上面基本相同，只是使用方法'HEAD'而不是'GET'.

score 1 · Accepted Answer

如果设置了 Content-Length 标头，这将起作用

import urllib2          
req = urllib2.urlopen("http://example.com/file.zip")
total_size = int(req.info().getheader('Content-Length'))

score 1 · Accepted Answer

您可以先检查 HEAD 请求中的内容长度，但请注意，不必设置此标头 - 请参阅如何在 Python 2 中发送 HEAD HTTP 请求？

python - 我正在使用 Python urllib2 下载文件。如何检查文件大小？

4 回答 4

Related

Reference