hadoop - why does the numbers of maps increasing will affect the bandwidth and cluster utilization on hadoop?

Question

Recently I was reading the book, hadoop: the definitive guide which the part is two clusters copy data using distcp, and I saw the comment: "When data size is very large, it becomes necessary to limit the number of maps in order to limit bandwidth and cluster utilization"

I cannot get the meaning why? I think we should utilize the bandwidth as wide as possible to increase the efficiency of cluster. So why should we limit the number of maps?

score 1 · Accepted Answer

当然有更多的没有。映射器的数量有助于我们实现更高的并行度，但如果太高，它就会开始成为瓶颈。例如，如果您的映射器比没有的多得多。从服务器上可用的 CPU 插槽数，大多数映射器将处于等待状态。同样，您可能会耗尽内存并可能面临网络拥塞。此外，创建这么多 InputSplit 和创建这么多地图需要更多时间。因此，映射器的数量应该相当高。不会太高，也不会太低。实际上框架在正常情况下会为您执行此操作，因此您不必担心。但有时您可能想根据自己的要求自行完成，但请记住上述事项。

高温高压

hadoop - why does the numbers of maps increasing will affect the bandwidth and cluster utilization on hadoop?

1 回答 1

Related

Reference