python - 使用 Python 请求“桥接”文件而不加载到内存中？

Question

我想使用Python Requests库从 url 获取文件并将其用作发布请求中的多部分编码文件。问题是该文件可能非常大（50MB-2GB），我不想将它加载到内存中。（这里的上下文。）

按照文档中的示例（multipart，stream down和stream up）我做了这样的事情：

    with requests.get(big_file_url, stream=True) as f:
        requests.post(upload_url, files={'file': ('filename', f.content)})

但我不确定我做得对。它实际上是在抛出这个错误 - 从回溯中编辑：

    with requests.get(big_file_url, stream=True) as f:
    AttributeError: __exit__

有什么建议么？

score 4 · Accepted Answer

正如其他答案已经指出的那样：requests不支持 POSTing multipart-encoded files without loading them into memory。

要使用 multipart/form-data 上传大文件而不将其加载到内存中，您可以使用poster：

#!/usr/bin/env python
import sys
from urllib2 import Request, urlopen

from poster.encode import multipart_encode # $ pip install poster
from poster.streaminghttp import register_openers

register_openers() # install openers globally

def report_progress(param, current, total):
    sys.stderr.write("\r%03d%% of %d" % (int(1e2*current/total + .5), total))

url = 'http://example.com/path/'
params = {'file': open(sys.argv[1], "rb"), 'name': 'upload test'}
response = urlopen(Request(url, *multipart_encode(params, cb=report_progress)))
print response.read()

它可以调整为允许 GET 响应对象而不是本地文件：

import posixpath
import sys
from urllib import unquote
from urllib2 import Request, urlopen
from urlparse import urlsplit

from poster.encode import MultipartParam, multipart_encode # pip install poster
from poster.streaminghttp import register_openers

register_openers() # install openers globally

class MultipartParamNoReset(MultipartParam):
    def reset(self):
        pass # do nothing (to allow self.fileobj without seek() method)

get_url = 'http://example.com/bigfile'
post_url = 'http://example.com/path/'

get_response = urlopen(get_url)
param = MultipartParamNoReset(
    name='file',
    filename=posixpath.basename(unquote(urlsplit(get_url).path)), #XXX \ bslash
    filetype=get_response.headers['Content-Type'],
    filesize=int(get_response.headers['Content-Length']),
    fileobj=get_response)

params = [('name', 'upload test'), param]
datagen, headers = multipart_encode(params, cb=report_progress)
post_response = urlopen(Request(post_url, datagen, headers))
print post_response.read()

此解决方案需要Content-LengthGET 响应中的有效标头（已知文件大小）。如果文件大小未知，则可以使用分块传输编码来上传多部分/表单数据内容。urllib3.filepost可以使用库附带的类似解决方案来实现，requests例如，基于@AdrienF 的回答而不使用poster.

score 2 · Accepted Answer

你不能把任何你喜欢的东西变成 python 中的上下文管理器。它需要非常具体的属性才能成为一个。使用您当前的代码，您可以执行以下操作：

response = requests.get(big_file_url, stream=True)

post_response = requests.post(upload_url, files={'file': ('filename', response.iter_content())})

使用iter_content将确保您的文件永远不会在内存中。将使用迭代器，否则通过使用content属性将文件加载到内存中。

编辑合理地做到这一点的唯一方法是使用块编码上传，例如，

post_response = requests.post(upload_url, data=response.iter_content())

如果您绝对需要进行 multipart/form-data 编码，那么您将必须创建一个抽象层，该层将在构造函数中使用生成器，以及Content-Length来自response（为提供答案len(file)）的标头，该标头将具有读取属性，该属性将读取从发电机。问题再次是我很确定整个内容会在上传之前被读入内存。

编辑#2

您也许可以制作自己的生成器，自己生成multipart/form-data编码数据。您可以通过与块编码请求相同的方式传递它，但您必须确保设置自己的Content-Type和Content-Length标头。我没有时间画一个例子，但应该不会太难。

score 1 · Accepted Answer

实际上在 Kenneth Reitz 的 GitHub repo上有一个关于这个的问题。我有同样的问题（虽然我只是上传一个本地文件），我添加了一个包装类，它是一个与请求的不同部分相对应的流列表，并带有一个迭代列表的 read() 属性和读取每个部分，并获取标题的必要值（边界和内容长度）：

# coding=utf-8

from __future__ import unicode_literals
from mimetools import choose_boundary
from requests.packages.urllib3.filepost import iter_fields, get_content_type
from io import BytesIO
import codecs

writer = codecs.lookup('utf-8')[3]

class MultipartUploadWrapper(object):

    def __init__(self, files):
        """
        Initializer

        :param files:
            A dictionary of files to upload, of the form {'file': ('filename', <file object>)}
        :type network_down_callback:
            Dict
        """
        super(MultipartUploadWrapper, self).__init__()
        self._cursor = 0
        self._body_parts = None
        self.content_type_header = None
        self.content_length_header = None
        self.create_request_parts(files)

    def create_request_parts(self, files):
        request_list = []
        boundary = choose_boundary()
        content_length = 0

        boundary_string = b'--%s\r\n' % (boundary)
        for fieldname, value in iter_fields(files):
            content_length += len(boundary_string)

            if isinstance(value, tuple):
                filename, data = value
                content_disposition_string = (('Content-Disposition: form-data; name="%s"; ''filename="%s"\r\n' % (fieldname, filename))
                                            + ('Content-Type: %s\r\n\r\n' % (get_content_type(filename))))

            else:
                data = value
                content_disposition_string =  (('Content-Disposition: form-data; name="%s"\r\n' % (fieldname))
                                            + 'Content-Type: text/plain\r\n\r\n')
            request_list.append(BytesIO(str(boundary_string + content_disposition_string)))
            content_length += len(content_disposition_string)
            if hasattr(data, 'read'):
                data_stream = data
            else:
                data_stream = BytesIO(str(data))

            data_stream.seek(0,2)
            data_size = data_stream.tell()
            data_stream.seek(0)

            request_list.append(data_stream)
            content_length += data_size

            end_string = b'\r\n'
            request_list.append(BytesIO(end_string))
            content_length += len(end_string)

        request_list.append(BytesIO(b'--%s--\r\n' % (boundary)))
        content_length += len(boundary_string)

        # There's a bug in httplib.py that generates a UnicodeDecodeError on binary uploads if
        # there are *any* unicode strings passed into headers as part of the requests call.
        # For this reason all strings are explicitly converted to non-unicode at this point.
        self.content_type_header = {b'Content-Type': b'multipart/form-data; boundary=%s' % boundary}
        self.content_length_header = {b'Content-Length': str(content_length)}
        self._body_parts = request_list

    def read(self, chunk_size=0):
        remaining_to_read = chunk_size
        output_array = []
        while remaining_to_read > 0:
            body_part = self._body_parts[self._cursor]
            current_piece = body_part.read(remaining_to_read)
            length_read = len(current_piece)
            output_array.append(current_piece)
            if length_read < remaining_to_read:
                # we finished this piece but haven't read enough, moving on to the next one
                remaining_to_read -= length_read
                if self._cursor == len(self._body_parts) - 1:
                    break
                else:
                    self._cursor += 1
            else:
                break
        return b''.join(output_array)

因此，不是传递“文件”关键字 arg，而是将此对象作为“数据”属性传递给 Request.request 对象

编辑

我已经清理了代码

score 1 · Accepted Answer

理论上你可以只使用原始对象

In [1]: import requests

In [2]: raw = requests.get("http://download.thinkbroadband.com/1GB.zip", stream=True).raw

In [3]: raw.read(10)
Out[3]: '\xff\xda\x18\x9f@\x8d\x04\xa11_'

In [4]: raw.read(10)
Out[4]: 'l\x15b\x8blVO\xe7\x84\xd8'

In [5]: raw.read() # take forever...

In [6]: raw = requests.get("http://download.thinkbroadband.com/5MB.zip", stream=True).raw

In [7]: requests.post("http://www.amazon.com", {'file': ('thing.zip', raw, 'application/zip')}, stream=True)
Out[7]: <Response [200]>

python - 使用 Python 请求“桥接”文件而不加载到内存中？

4 回答 4

编辑

Related

Reference