python - 如何在 urllib.urlretrieve 中捕获 404 错误

Question

背景：我正在使用urllib.urlretrieve，与模块中的任何其他功能相反urllib*，因为挂钩功能支持（见reporthook下文）.. 用于显示文本进度条。这是 Python >=2.6。

>>> urllib.urlretrieve(url[, filename[, reporthook[, data]]])

然而，urlretrieve它是如此愚蠢以至于无法检测 HTTP 请求的状态（例如：它是 404 还是 200？）。

>>> fn, h = urllib.urlretrieve('http://google.com/foo/bar')
>>> h.items() 
[('date', 'Thu, 20 Aug 2009 20:07:40 GMT'),
 ('expires', '-1'),
 ('content-type', 'text/html; charset=ISO-8859-1'),
 ('server', 'gws'),
 ('cache-control', 'private, max-age=0')]
>>> h.status
''
>>>

下载具有类似钩子的支持（显示进度条）和体面的 HTTP 错误处理的远程 HTTP 文件的最知名方法是什么？

score 28 · Accepted Answer

查看urllib.urlretrieve完整的代码：

def urlretrieve(url, filename=None, reporthook=None, data=None):
  global _urlopener
  if not _urlopener:
    _urlopener = FancyURLopener()
  return _urlopener.retrieve(url, filename, reporthook, data)

换句话说，您可以使用urllib.FancyURLopener（它是公共 urllib API 的一部分）。您可以覆盖http_error_default以检测 404：

class MyURLopener(urllib.FancyURLopener):
  def http_error_default(self, url, fp, errcode, errmsg, headers):
    # handle errors the way you'd like to

fn, h = MyURLopener().retrieve(url, reporthook=my_report_hook)

score 15 · Accepted Answer

你应该使用：

import urllib2

try:
    resp = urllib2.urlopen("http://www.google.com/this-gives-a-404/")
except urllib2.URLError, e:
    if not hasattr(e, "code"):
        raise
    resp = e

print "Gave", resp.code, resp.msg
print "=" * 80
print resp.read(80)

编辑：这里的基本原理是，除非您期望异常状态，否则它是一个异常发生，您可能甚至没有考虑过 - 因此，与其让您的代码在不成功时继续运行，不如默认行为是——非常明智地——禁止其执行。

score 2 · Accepted Answer

URL Opener 对象的“检索”方法支持报告挂钩并在 404 上引发异常。

http://docs.python.org/library/urllib.html#url-opener-objects

python - 如何在 urllib.urlretrieve 中捕获 404 错误

3 回答 3

Related

Reference