python - 如何在使用 boto 上传到 s3 时进行 gzip

Question

我有一个很大的本地文件。我想使用该boto库将该文件的压缩版本上传到 S3。该文件太大，无法在上传之前在磁盘上有效地对其进行 gzip，因此应在上传期间以流式方式对其进行 gzip。

该boto库知道一个函数，该函数set_contents_from_file()需要一个类似文件的对象，它将从中读取。

库gzip知道GzipFile可以通过名为的参数获取对象的类fileobj；压缩时它将写入此对象。

我想把这两个功能结合起来，但是一个API想自己读，另一个API想自己写；都不知道被动操作（比如被写入或被读取）。

有人知道如何以工作方式将这些结合起来吗？

编辑：我接受了一个答案（见下文），因为它提示我去哪里，但如果你有同样的问题，你可能会发现我自己的答案（也在下文）更有帮助，因为我在其中使用分段上传实现了一个解决方案.

score 28 · Accepted Answer

我实施了 garnaat 接受的答案的评论中暗示的解决方案：

import cStringIO
import gzip

def sendFileGz(bucket, key, fileName, suffix='.gz'):
    key += suffix
    mpu = bucket.initiate_multipart_upload(key)
    stream = cStringIO.StringIO()
    compressor = gzip.GzipFile(fileobj=stream, mode='w')

    def uploadPart(partCount=[0]):
        partCount[0] += 1
        stream.seek(0)
        mpu.upload_part_from_file(stream, partCount[0])
        stream.seek(0)
        stream.truncate()

    with file(fileName) as inputFile:
        while True:  # until EOF
            chunk = inputFile.read(8192)
            if not chunk:  # EOF?
                compressor.close()
                uploadPart()
                mpu.complete_upload()
                break
            compressor.write(chunk)
            if stream.tell() > 10<<20:  # min size for multipart upload is 5242880
                uploadPart()

它似乎没有问题。毕竟，流媒体在大多数情况下只是数据的分块。在这种情况下，块大约有 10MB 大，但谁在乎呢？只要我们不是在谈论几个 GB 块，我就可以了。

Python 3 的更新：

from io import BytesIO
import gzip

def sendFileGz(bucket, key, fileName, suffix='.gz'):
    key += suffix
    mpu = bucket.initiate_multipart_upload(key)
    stream = BytesIO()
    compressor = gzip.GzipFile(fileobj=stream, mode='w')

    def uploadPart(partCount=[0]):
        partCount[0] += 1
        stream.seek(0)
        mpu.upload_part_from_file(stream, partCount[0])
        stream.seek(0)
        stream.truncate()

    with open(fileName, "rb") as inputFile:
        while True:  # until EOF
            chunk = inputFile.read(8192)
            if not chunk:  # EOF?
                compressor.close()
                uploadPart()
                mpu.complete_upload()
                break
            compressor.write(chunk)
            if stream.tell() > 10<<20:  # min size for multipart upload is 5242880
                uploadPart()

score 9 · Accepted Answer

您还可以使用 gzip 轻松压缩 Bytes 并轻松上传如下：

import gzip
import boto3

cred = boto3.Session().get_credentials()

s3client = boto3.client('s3',
                            aws_access_key_id=cred.access_key,
                            aws_secret_access_key=cred.secret_key,
                            aws_session_token=cred.token
                            )

bucketname = 'my-bucket-name'      
key = 'filename.gz'  

s_in = b"Lots of content here"
gzip_object = gzip.compress(s_in)

s3client.put_object(Bucket=bucket, Body=gzip_object, Key=key)

可以s_in用任何字节、io.BytesIO、pickle 转储、文件等替换。

如果你想上传压缩的 Json，那么这里是一个很好的例子：Uploadcompressed Json to S3

score 6 · Accepted Answer

确实没有办法做到这一点，因为 S3 不支持真正的流输入（即分块传输编码）。您必须在上传之前知道 Content-Length，唯一知道的方法是先执行 gzip 操作。

python - 如何在使用 boto 上传到 s3 时进行 gzip

3 回答 3

Related

Reference