当我调用以下函数来处理一长串 URL(访问同一站点(即http://foo.bar.com/url1
等http://foo.bar.com/url2
)时:
import time
import grequests
def processUrls(block=2500, write=100000, timeout=0.5):
urls = ... ## generate long array of URLs
chunks = [urls[i:i+block] for i in xrange(0, len(urls), block)] ## chunk 'em
def callback(response, *args, **kwargs):
txt = response.text
## do something with txt
response.close()
for i, chunk in enumerate(chunks):
rs = [grequests.get(url, callback=callback) for url in chunk]
grequests.map(rs, stream=False, size=block / 10)
time.sleep(timeout)
## do stuff
我收到一堆这样的错误:
File "/.../python2.7/site-packages/gevent/greenlet.py", line 327, in run
result = self._run(*self.args, **self.kwargs)
File "/.../python2.7/site-packages/grequests.py", line 71, in send
self.url, **merged_kwargs)
File "/.../python2.7/site-packages/requests/sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "/.../python2.7/site-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/.../python2.7/site-packages/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', error(97, 'Address family not supported by protocol'))
<Greenlet at 0x7f8ce2c0ec30: <bound method AsyncRequest.send of <grequests.AsyncRequest object at 0x7f8ce31e2890>>(stream=False)> failed with ConnectionError
消息的数量远小于 URL 的数量。
什么可能导致这些错误?我在 RedHat 6.6 上运行它
更新:我从我一直在使用的完整数据集中收集了所有给我错误的 URL。它们看起来都很好(格式正确等),当我将其中一个粘贴到浏览器中时,我得到了有意义的结果并且没有错误消息。然后,我只用一部分数据重新运行了测试。同样,出现了一些错误并收集了子集的错误 URL 列表。事实证明,子集中的所有错误 URL 都不在完整集的错误 URL 列表中。这表明该错误并不是真正的 URL 特定的,而是某种类型的打嗝,无论是在我这边还是在另一边。这会敲响警钟吗?