1

当我调用以下函数来处理一长串 URL(访问同一站点(即http://foo.bar.com/url1http://foo.bar.com/url2)时:

import time
import grequests

def processUrls(block=2500, write=100000, timeout=0.5):
    urls = ...  ## generate long array of URLs
    chunks = [urls[i:i+block] for i in xrange(0, len(urls), block)] ## chunk 'em

    def callback(response, *args, **kwargs):
        txt = response.text
        ## do something with txt
        response.close()

    for i, chunk in enumerate(chunks):
        rs = [grequests.get(url, callback=callback) for url in chunk]
        grequests.map(rs, stream=False, size=block / 10)
        time.sleep(timeout)
        ## do stuff

我收到一堆这样的错误:

File "/.../python2.7/site-packages/gevent/greenlet.py", line 327, in run
result = self._run(*self.args, **self.kwargs)
File "/.../python2.7/site-packages/grequests.py", line 71, in send
self.url, **merged_kwargs)
File "/.../python2.7/site-packages/requests/sessions.py", line 465, in request
resp = self.send(prep, **send_kwargs)
File "/.../python2.7/site-packages/requests/sessions.py", line 573, in send
r = adapter.send(request, **kwargs)
File "/.../python2.7/site-packages/requests/adapters.py", line 415, in send
raise ConnectionError(err, request=request)
ConnectionError: ('Connection aborted.', error(97, 'Address family not supported by protocol'))
<Greenlet at 0x7f8ce2c0ec30: <bound method AsyncRequest.send of <grequests.AsyncRequest object at 0x7f8ce31e2890>>(stream=False)> failed with ConnectionError

消息的数量远小于 URL 的数量。

什么可能导致这些错误?我在 RedHat 6.6 上运行它

更新:我从我一直在使用的完整数据集中收集了所有给我错误的 URL。它们看起来都很好(格式正确等),当我将其中一个粘贴到浏览器中时,我得到了有意义的结果并且没有错误消息。然后,我只用一部分数据重新运行了测试。同样,出现了一些错误并收集了子集的错误 URL 列表。事实证明,子集中的所有错误 URL 都不在完整集的错误 URL 列表中。这表明该错误并不是真正的 URL 特定的,而是某种类型的打嗝,无论是在我这边还是在另一边。这会敲响警钟吗?

4

0 回答 0