1

今天早些时候,我能够使用下面的代码从谷歌专利中提取数据

import urllib2

url = 'http://www.google.com/search?tbo=p&q=ininventor:"John-Mudd"&hl=en&tbm=pts&source=lnt&tbs=ptso:us'
req = urllib2.Request(url, headers={'User-Agent' : "foobar"})

response = urllib2.urlopen(req)

现在,当我运行它时,我收到以下 503 错误。我只在上面循环了这段代码 30 次(我试图获得 30 个人拥有的所有专利)。

HTTPError                                 Traceback (most recent call last)
<ipython-input-4-01f83e2c218f> in <module>()
----> 1 response = urllib2.urlopen(req)

C:\Python27\lib\urllib2.pyc in urlopen(url, data, timeout)
    124     if _opener is None:
    125         _opener = build_opener()
--> 126     return _opener.open(url, data, timeout)
    127 
    128 def install_opener(opener):

C:\Python27\lib\urllib2.pyc in open(self, fullurl, data, timeout)
    404         for processor in self.process_response.get(protocol, []):
    405             meth = getattr(processor, meth_name)
--> 406             response = meth(req, response)
    407 
    408         return response

C:\Python27\lib\urllib2.pyc in http_response(self, request, response)
    517         if not (200 <= code < 300):
    518             response = self.parent.error(
--> 519                 'http', request, response, code, msg, hdrs)
    520 
    521         return response

C:\Python27\lib\urllib2.pyc in error(self, proto, *args)
    436             http_err = 0
    437         args = (dict, proto, meth_name) + args
--> 438         result = self._call_chain(*args)
    439         if result:
    440             return result

C:\Python27\lib\urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    376             func = getattr(handler, meth_name)
    377 
--> 378             result = func(*args)
    379             if result is not None:
    380                 return result

C:\Python27\lib\urllib2.pyc in http_error_302(self, req, fp, code, msg, headers)
    623         fp.close()
    624 
--> 625         return self.parent.open(new, timeout=req.timeout)
    626 
    627     http_error_301 = http_error_303 = http_error_307 = http_error_302

C:\Python27\lib\urllib2.pyc in open(self, fullurl, data, timeout)
    404         for processor in self.process_response.get(protocol, []):
    405             meth = getattr(processor, meth_name)
--> 406             response = meth(req, response)
    407 
    408         return response

C:\Python27\lib\urllib2.pyc in http_response(self, request, response)
    517         if not (200 <= code < 300):
    518             response = self.parent.error(
--> 519                 'http', request, response, code, msg, hdrs)
    520 
    521         return response

C:\Python27\lib\urllib2.pyc in error(self, proto, *args)
    442         if http_err:
    443             args = (dict, 'default', 'http_error_default') + orig_args
--> 444             return self._call_chain(*args)
    445 
    446 # XXX probably also want an abstract factory that knows when it makes

C:\Python27\lib\urllib2.pyc in _call_chain(self, chain, kind, meth_name, *args)
    376             func = getattr(handler, meth_name)
    377 
--> 378             result = func(*args)
    379             if result is not None:
    380                 return result

C:\Python27\lib\urllib2.pyc in http_error_default(self, req, fp, code, msg, hdrs)
    525 class HTTPDefaultErrorHandler(BaseHandler):
    526     def http_error_default(self, req, fp, code, msg, hdrs):
--> 527         raise HTTPError(req.get_full_url(), code, msg, hdrs, fp)
    528 
    529 class HTTPRedirectHandler(BaseHandler):

HTTPError: HTTP Error 503: Service Unavailable
4

2 回答 2

4

可悲的是,谷歌的 TOS 禁止自动查询。它几乎可以肯定地检测到你“没有做好事”。

来源:https ://support.google.com/websearch/answer/86640?hl=en

于 2013-07-12T07:32:54.897 回答
1

暗中猜测:

您是否查看响应中是否有“Retry-After 标头”。503确实有可能。

来自 RFC 2616

14.37 重试后

Retry-After response-header 字段可以与 503(服务不可用)响应一起使用,以指示服务预计对请求客户端不可用的时间。该字段也可以与任何 3xx(重定向)响应一起使用,以指示在发出重定向请求之前要求用户代理等待的最短时间。该字段的值可以是响应时间之后的 HTTP 日期或整数秒数(十进制)。Retry-After = "Retry-After" ":" (HTTP-date | delta-seconds)

其使用的两个示例是 Retry-After: Fri, 31 Dec 1999 23:59:59 GMT Retry-After: 120

在后一个示例中,延迟为 2 分钟。

于 2013-03-19T18:22:25.007 回答