2

Recently I was reading the book, hadoop: the definitive guide which the part is two clusters copy data using distcp, and I saw the comment: "When data size is very large, it becomes necessary to limit the number of maps in order to limit bandwidth and cluster utilization"

I cannot get the meaning why? I think we should utilize the bandwidth as wide as possible to increase the efficiency of cluster. So why should we limit the number of maps?

4

1 回答 1

1

当然有更多的没有。映射器的数量有助于我们实现更高的并行度,但如果太高,它就会开始成为瓶颈。例如,如果您的映射器比没有的多得多。从服务器上可用的 CPU 插槽数,大多数映射器将处于等待状态。同样,您可能会耗尽内存并可能面临网络拥塞。此外,创建这么多 InputSplit 和创建这么多地图需要更多时间。因此,映射器的数量应该相当高。不会太高,也不会太低。实际上框架在正常情况下会为您执行此操作,因此您不必担心。但有时您可能想根据自己的要求自行完成,但请记住上述事项。

高温高压

于 2013-05-09T13:23:13.890 回答