I have an S3 bucket with log files that I want to concatenate, then use as an input to an EMR job. The log files are in paths like: bucket-name/[date]/product/out/[hour]/[minute-based-file]
. I'd like to take all the minute logs in all the hour directories in all the date directories, and concatenate them into one file. I want to use that file as an input to an EMR job. The original log files need to be preserved, and the new combined log file will probably be written to a different S3 bucket.
I tried using hadoop fs -getmerge
on the EMR master node via SSH, but got this error:
This file system object (file:///) does not support access to the request path 's3://target-bucket-name/merged.log'
The source S3 bucket has some other files in it, so I don't want to include all of its files. The wildcard match looks like this: s3n://bucket-name/*/product/out/*/log.*
.
The purpose is to get around the problem of having tens/hundreds of thousands of small (10k-3mb) input files to EMR, and instead give it one large file that it can split more efficiently.