python - 使用 grequests 向 sourceforge 发出数千个 get 请求，得到“Max retries exceeded with url”

Question

我对这一切都很陌生；我需要为我正在写的一篇论文获取数千个 sourceforge 项目的数据。这些数据都以 json 格式免费提供，网址为 http://sourceforge.net/api/project/name/[project name]/json。我有数千个这些 URL 的列表，我正在使用以下代码。

import grequests
rs = (grequests.get(u) for u in ulist)
answers = grequests.map(rs)

使用此代码，我可以获得我喜欢的任何 200 个左右项目的数据，即rs = (grequests.get(u) for u in ulist[0:199])有效，但一旦我完成，所有尝试都会遇到

ConnectionError: HTTPConnectionPool(host='sourceforge.net', port=80): Max retries exceeded with url: /api/project/name/p2p-fs/json (Caused by <class 'socket.gaierror'>: [Errno 8] nodename nor servname provided, or not known)
<Greenlet at 0x109b790f0: <bound method AsyncRequest.send of <grequests.AsyncRequest object at 0x10999ef50>>(stream=False)> failed with ConnectionError

在我退出 python 之前，我无法再发出任何请求，但是一旦我重新启动 python，我就可以再发出 200 个请求。

我试过使用grequests.map(rs,size=200)，但这似乎无济于事。

score 28 · Accepted Answer

在我的例子中，它不是目标服务器的速率限制，而是更简单的事情：我没有明确关闭响应，所以它们保持套接字打开，python 进程用完了文件句柄。

我的解决方案（不确定是哪一个解决了这个问题——理论上它们都应该）是：

设置stream=False在grequests.get：

 rs = (grequests.get(u, stream=False) for u in urls)

response.close()在我阅读 response.content 后明确调用：

 responses = grequests.map(rs)
 for response in responses:
       make_use_of(response.content)
       response.close()

注意：仅仅销毁response对象（分配None给它，调用gc.collect()）是不够的——这并没有关闭文件句柄。

score 2 · Accepted Answer

这个可以很容易地更改为使用您想要的任何数量的连接。

MAX_CONNECTIONS = 100 #Number of connections you want to limit it to
# urlsList: Your list of URLs. 

results = []
for x in range(1,pages+1, MAX_CONNECTIONS):
    rs = (grequests.get(u, stream=False) for u in urlsList[x:x+MAX_CONNECTIONS])
    time.sleep(0.2) #You can change this to whatever you see works better. 
    results.extend(grequests.map(rs)) #The key here is to extend, not append, not insert. 
    print("Waiting") #Optional, so you see something is done.

python - 使用 grequests 向 sourceforge 发出数千个 get 请求，得到“Max retries exceeded with url”

2 回答 2

Related

Reference