5

TL;DR: Trying to put .json files into S3 bucket using Boto3, process is very slow. Looking for ways to speed it up.

This is my first question on SO, so I apologize if I leave out any important details. Essentially I am trying to pull data from Elasticsearch and store it in an S3 bucket using Boto3. I referred to this post to pull multiple pages of ES data using the scroll function of the ES Python client. As I am scrolling, I am processing the data and putting it in the bucket as a [timestamp].json format, using this:

    s3 = boto3.resource('s3')
    data = '{"some":"json","test":"data"}'
    key = "path/to/my/file/[timestamp].json"      
    s3.Bucket('my_bucket').put_object(Key=key, Body=data)

While running this on my machine, I noticed that this process is very slow. Using line profiler, I discovered that this line is consuming over 96% of the time in my entire program:

    s3.Bucket('my_bucket').put_object(Key=key, Body=data)

What modification(s) can I make in order to speed up this process? Keep in mind, I am creating the .json files in my program (each one is ~240 bytes) and streaming them directly to S3 rather than saving them locally and uploading the files. Thanks in advance.

4

1 回答 1

2

由于您可能会上传许多小文件,因此您应该考虑以下几项:

  • 某种形式的线程/多处理。例如,您可以查看如何在 Python 中高效地将小文件上传到 Amazon S3
  • 创建某种形式的存档文件 (ZIP),其中包含您的小数据块集并将它们作为较大的文件上传。这当然取决于您的访问模式。如果你走这条路,请务必使用 boto3upload_fileupload_fileobj方法,因为它们将处理多部分上传和线程。
  • 请求速率和性能注意事项中所述的 S3 性能影响
于 2018-06-21T15:49:16.470 回答