0

如何设置Hadoop中DataNode的数量?是由代码、配置还是环境决定的。同样在浏览文章时,有人说“每个节点的首选地图数量约为 10-100 个地图”,所以这里的“节点”是指 NameNode 还是 DataNode?

而说到MapTasks的个数,有的说等于split的个数,有的说是block的个数,还有的说是框架决定的,可能没有给出确切的split或者block的个数,所以就从他们那里?

4

1 回答 1

3

Question : How to set the number of DataNodes in Hadoop?

For setting or calculating the number of DataNodes. Firstly estimate the Hadoop Storage (H) :

H=crS/(1-i)

where:

c = average compression ratio. It depends on the type of compression used (Snappy, LZOP, ...) and size of the data. When no compression is used, c=1.

r = replication factor. It is usually 3 in a production cluster.

S = size of data to be moved to Hadoop. This could be a combination of historical data and incremental data. The incremental data can be daily for example and projected over a period of time (3 years for example).

i = intermediate factor. It is usually 1/3 or 1/4. Hadoop's working space dedicated to storing intermediate results of Map phases.

Example: With no compression i.e. c=1, a replication factor of 3, an intermediate factor of .25=1/4

H= 1*3*S/(1-1/4)=3*S/(3/4)=4*S

With the assumptions above, the Hadoop storage is estimated to be 4 times the size of the initial data size.

Now the formula to estimate the number of Data nodes (n):

n= H/d = crS/(1-i)*d

where:

d = disk space available per node.

Question : "The preferred number of maps around 10-100 maps per-node" so "node" here means NameNode or DataNode?

As you know that MapReduce jobs go to the data for the processing but vice-versa is not true. So here "node" is Data Node.

Question : How Many Maps?

The number of maps is usually driven by the total size of the inputs, that is, the total number of blocks of the input files.

The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very cpu-light map tasks. Task setup takes a while, so it is best if the maps take at least a minute to execute.

If you havve 10TB of input data and a blocksize of 128MB, you’ll end up with 82,000 maps, unless Configuration.set(MRJobConfig.NUM_MAPS, int) (which only provides a hint to the framework) is used to set it even higher.

于 2016-11-29T09:32:24.510 回答