我有一个小型爬虫,我正在提取一个简单页面的网页内容。
def url2dict(url):
'''
DOCSTRING: converts two-column data into a dictionary with first column as a key.
INPUT: URL address as a string
OUTPUT: dictionary with one key and one value
'''
with urlopen(url) as page:
page_raw = page.read()
...
现在这个函数在 url 调用服务器。问题是服务器产生了504错误
File "C:\Python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
File "C:\Python38\lib\urllib\request.py", line 531, in open
response = meth(req, response)
File "C:\Python38\lib\urllib\request.py", line 640, in http_response
response = self.parent.error(
File "C:\Python38\lib\urllib\request.py", line 569, in error
return self._call_chain(*args)
File "C:\Python38\lib\urllib\request.py", line 502, in _call_chain
result = func(*args)
File "C:\Python38\lib\urllib\request.py", line 649, in http_error_default
raise HTTPError(req.full_url, code, msg, hdrs, fp)
urllib.error.HTTPError: HTTP Error 504: Gateway Time-out
我的问题是我找不到 urlopen 超时的默认值。
这里https://bugs.python.org/issue18417据说默认没有超时(timeout = None)(至少对于 Python 3.4 版本):
好的,我回顾了这个问题足以记住:如果从不调用 socket.setdefaulttimeout,则默认超时为无(无超时)。
3.8 的当前状态是什么?
如果没有设置超时,为什么我得到这个错误 504 的错误?
更多细节:
其中一个错误显示错误
File "C:\Python38\lib\urllib\request.py", line 222, in urlopen
return opener.open(url, data, timeout)
我打开文件并阅读:
def urlopen(url, data=None, timeout=socket._GLOBAL_DEFAULT_TIMEOUT, *, cafile=None, capath=None, cadefault=False, context=None): '''打开网址url,可以是字符串也可以是请求对象。
*data* must be an object specifying additional data to be sent to
the server, or None if no such data is needed. See Request for
details.
urllib.request module uses HTTP/1.1 and includes a "Connection:close"
header in its HTTP requests.
The optional *timeout* parameter specifies a timeout in seconds for
blocking operations like the connection attempt (if not specified, the
global default timeout setting will be used). This only works for HTTP,
HTTPS and FTP connections.
那么(如果未指定,将使用全局默认超时设置)是否意味着如果我定义了一个名为timeout的全局变量,它将用作超时持续时间?