amazon-web-services - Performance issue with AWS EMR S3DistCp

Question

I am using S3DistCp on an EMR cluster in order to aggregate around 200K small files (for a total of 3.4GB) from a S3 bucket to another path in the same bucket. It is working but it is extremely slow (around 600MB transferred after more than 20 minutes).

Here is my EMR configuration:

1 master m5.xlarge
3 cores m5.xlarge
release label 5.29.0

The command:

s3-dist-cp --s3Endpoint=s3-eu-central-1.amazonaws.com --src=s3://my-bucket/input/ --dest=s3://my-bucket/output/ --groupBy=.*input/(entry).*(.json.gz) --targetSize=128

Am I missing something ? I have read that S3DistCp can transfer a lot of files in a blink but I can't figure how. Both EMR and bucket are in the same region by the way.

Thank you.

score 0 · Accepted Answer

以下是推荐

使用 R 类型实例。与 M 类型实例相比，它将提供更多内存
使用 coalesce 合并源中的文件，因为您有许多小文件
检查映射器任务的数量。任务越多，性能越差

amazon-web-services - Performance issue with AWS EMR S3DistCp

1 回答 1

Related

Reference