1

I need to move a large amount of files (on the order of tens of terabytes) from Amazon S3 into Google Cloud Storage. The files in S3 are all under 500mb.

So far I have tried using gsutil cp with the parallel option (-m) to using S3 as source and GS as destination directly. Even tweaking the multi-processing and multi-threading parameters I haven't been able to achieve a performance of over 30mb/s.

What I am now contemplating:

  • Load the data in batches from S3 into hdfs using distcp and then finding a way of distcp-ing all the data into google storage (not supported as far as I can tell), or:

  • Set up a hadoop cluster where each node runs a gsutil cp parallel job with S3 and GS as src and dst

If the first option were supported, I would really appreciate details on how to do that. However, it seems like I'm gonna have to find out how to do the second one. I'm unsure of how to pursue this avenue because I would need to keep track of the gsutil resumable transfer feature on many nodes and I'm generally inexperienced running this sort of hadoop job.

Any help on how to pursue one of these avenues (or something simpler I haven't thought of) would be greatly appreciated.

4

2 回答 2

5

您可以设置一个Google Compute Engine (GCE) 帐户并从 GCE 运行 gsutil 来导入数据。您可以启动多个 GCE 实例,每个实例导入数据的一个子集。这是我们在 2013 年 Google I/O 上的演讲中所涉及的技术之一,名为将大型数据集导入 Google 云存储

如果您使用这种方法,您还想做的另一件事是使用gsutil cp -Land-n选项。-L创建一个清单,记录有关已传输内容的详细信息,并-n允许您避免重新复制已复制的文件(以防您从头开始复制,例如,在中断之后)。我建议您更新到 gsutil 版本 3.30(将在下周左右发布),这改进了该-L选项在这种复制场景中的工作方式。

Mike Schwartz,谷歌云存储团队

于 2013-06-06T19:51:26.903 回答
3

谷歌最近发布了云存储传输服务,旨在将大量数据从 S3 传输到 GCS: https ://cloud.google.com/storage/transfer/getting-started

(我意识到这个答案对于原始问题来说有点晚了,但它可能会帮助未来的访问者解决同样的问题。)

于 2015-07-30T20:52:43.013 回答