我知道 EC2 更灵活,但比 EMR 工作更多。然而就成本而言,如果使用 EC2,它可能需要将 EBS 卷附加到 EC2 实例,而 AWS 只是从 S3 中流式传输数据。因此,在 AWS 计算器上计算数字,即使 EMR 也必须为 EC2 付费,EMR 变得比 EC2 便宜??我错了吗?当然,带有 EBS 的 EC2 可能更快,但值得付出代价吗?
谢谢,马特
我知道 EC2 更灵活,但比 EMR 工作更多。然而就成本而言,如果使用 EC2,它可能需要将 EBS 卷附加到 EC2 实例,而 AWS 只是从 S3 中流式传输数据。因此,在 AWS 计算器上计算数字,即使 EMR 也必须为 EC2 付费,EMR 变得比 EC2 便宜??我错了吗?当然,带有 EBS 的 EC2 可能更快,但值得付出代价吗?
谢谢,马特
EMR does a lot of things for you that you won't find on standard Hadoop on EC2. Some particularly important ones include
You'll also find that the EMR S3 filesystem is faster and more reliable than the standard one packaged with Apache Hadoop. It supports Multipart upload, and streams writes directly to S3 rather than buffering to disk first. For a bit more on this, see Tip #5
Additionally, if you do decide to use EC2 directly, I'd recommend using instance-storage instead of EBS for your nodes. There's really no reason to pay the extra cost of EBS for Hadoop; you'll notice that EMR clusters all run on instance-storage nodes as well.
You are correct that EMR uses instance-store backed EC2 instances, rather than EBS. However, there's nothing stopping you from creating an instance-store based instance, packing an AMI and using it for your Hadoop cluster. Using EBS also might not represent a lot of additional costs, depending on your workload and frequency. Also, there's an added cost to the EC2 instance when using it through EMR.
I've been using EMR for two years now and I would highly recommend the service as you don't need to invest time in managing and updating your distribution. If your workload is compatible with EMR (getting data from DynamoDB or S3), I would go for EMR as opposed to EC2/Hadoop.