I need to move a large amount of files (on the order of tens of terabytes) from Amazon S3 into Google Cloud Storage. The files in S3 are all under 500mb.
So far I have tried using gsutil cp with the parallel option (-m) to using S3 as source and GS as destination directly. Even tweaking the multi-processing and multi-threading parameters I haven't been able to achieve a performance of over 30mb/s.
What I am now contemplating:
Load the data in batches from S3 into hdfs using distcp and then finding a way of distcp-ing all the data into google storage (not supported as far as I can tell), or:
Set up a hadoop cluster where each node runs a gsutil cp parallel job with S3 and GS as src and dst
If the first option were supported, I would really appreciate details on how to do that. However, it seems like I'm gonna have to find out how to do the second one. I'm unsure of how to pursue this avenue because I would need to keep track of the gsutil resumable transfer feature on many nodes and I'm generally inexperienced running this sort of hadoop job.
Any help on how to pursue one of these avenues (or something simpler I haven't thought of) would be greatly appreciated.