python - 具有超时、最大大小和连接池的 http 请求

Question

我正在寻找一种在 Python (2.7) 中执行具有 3 个要求的 HTTP 请求的方法：

超时（为了可靠性）
内容最大大小（出于安全考虑）
连接池（用于性能）

我已经检查了几乎所有的 python HTTP 库，但没有一个符合我的要求。例如：

urllib2：很好，但没有池化

import urllib2
import json

r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100+1)
if len(content) > 100: 
    print 'too large'
    r.close()
else:
    print json.loads(content)

r = urllib2.urlopen('https://github.com/timeline.json', timeout=5)
content = r.read(100000+1)
if len(content) > 100000: 
    print 'too large'
    r.close()
else:
    print json.loads(content)

请求：没有最大尺寸

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)
r.headers['content-length'] # does not exists for this request, and not safe
content = r.raw.read(100000+1)
print content # ARF this is gzipped, so not the real size
print json.loads(content) # content is gzipped so pretty useless
print r.json() # Does not work anymore since raw.read was used

urllib3：从来没有让“读取”方法工作，即使是 50Mo 文件......

httplib：httplib.HTTPConnection 不是池（只有一个连接）

我简直不敢相信 urllib2 是我可以使用的最好的 HTTP 库！因此，如果有人知道什么库可以做到这一点或如何使用以前的库之一...

编辑：

多亏了 Martijn Pieters，我找到了最好的解决方案（即使对于大文件，StringIO 也不会减慢速度，其中 str 加法会做很多事情）。

r = requests.get('https://github.com/timeline.json', stream=True)
size = 0
ctt = StringIO()


for chunk in r.iter_content(2048):
    size += len(chunk)
    ctt.write(chunk)
    if size > maxsize:
        r.close()
        raise ValueError('Response too large')

content = ctt.getvalue()

score 19 · Accepted Answer

你可以requests很好地做到这一点；但是您需要知道该raw对象是urllib3胆量的一部分，并使用HTTPResponse.read()调用支持的额外参数，这使您可以指定要读取解码数据：

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)

content = r.raw.read(100000+1, decode_content=True)
if len(content) > 100000:
    raise ValueError('Too large a response')
print content
print json.loads(content)

或者，您可以在阅读之前在对象上设置decode_content标志：raw

import requests
r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)

r.raw.decode_content = True
content = r.raw.read(100000+1)
if len(content) > 100000:
    raise ValueError('Too large a response')
print content
print json.loads(content)

如果您不喜欢这样深入urllib3，请使用response.iter_content()来迭代解码的内容块；这也使用了底层HTTPResponse（使用.stream()生成器版本：

import requests

r = requests.get('https://github.com/timeline.json', timeout=5, stream=True)

maxsize = 100000
content = ''
for chunk in r.iter_content(2048):
    content += chunk
    if len(content) > maxsize:
        r.close()
        raise ValueError('Response too large')

print content
print json.loads(content)

此处处理压缩数据大小的方式存在细微差别；r.raw.read(100000+1)只会读取 100k 字节的压缩数据；未压缩的数据会根据您的最大大小进行测试。在压缩流大于未压缩数据的极少数情况下，该iter_content()方法将读取更多未压缩数据。

两种方法都不能r.json()工作；response._content属性不是由这些设置的；当然，您可以手动执行此操作。但是由于.raw.read()and.iter_content()调用已经让您可以访问相关内容，因此确实没有必要。

python - 具有超时、最大大小和连接池的 http 请求

1 回答 1

Related

Reference