0

I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20 minutes).

Here is my EMR configuration:

1 master m5.xlarge
3 cores m5.xlarge
release label 5.29.0

The command:

s3-dist-cp --s3Endpoint=s3-eu-central-1.amazonaws.com --src=s3://my-bucket/input/ --dest=s3://my-bucket/output/ --groupBy=.*input/(entry).*(.json.gz) --targetSize=128

Am I missing something ? I have read that S3DistCp can transfer a lot of files in a blink but I can't figure how. Both EMR and bucket are in the same region by the way.

Thank you.

4

1 回答 1

0

以下是推荐

  1. 使用 R 类型实例。与 M 类型实例相比,它将提供更多内存
  2. 使用 coalesce 合并源中的文件,因为您有许多小文件
  3. 检查映射器任务的数量。任务越多,性能越差
于 2020-10-11T17:03:56.547 回答