2

Let's assume that we have 3 mappers (m1, m2 and m3) and 2 reducers (r1 and r2).

Each reducer fetches its input partitions from the generated files by each mapper.

From the job history, I can extract the total input for each reduce task, but I would like to know the contribution of each mapper to this reducer input ?

For example, the reducer r1 will receive an INPUT_r1 such as:

INPUT_r1 = ( partition fetched from m1 ) + ( partition fetched from m2 ) + ( partition fetched from m3 )

I would like to know the size of those partitions from mappers ?

4

1 回答 1

0

为了从映射器中找到分区的大小,需要考虑几件事。

首先,我们应该明白,在 Hadoop 中,分区器在组合器之前执行,因此如果您的逻辑中有组合器,您将需要考虑它......如果它影响您查找大小的尝试。如果您发现尺寸与我在这里建议的不同,这很重要。

其次,默认分区HashPartitioner器为每个 reducer 分配大致相同数量的键。使用的方法是:

public int getPartition(K2 key, V2 value, int numReduceTasks) {

     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

请注意,partitioner 只考虑 key 而忽略 value,这可能导致分发给 reducer 的数据分布不均匀。

我要找出大小的方法是在您HashPartitioner或自定义分区器附近设置一个计数器,并说明每个所收集的键值对大小。然后为每个分区器打印此值。您可能需要跟踪每个分区将其数据发送到何处,因为分区器本身不知道他们将数据发送给谁。

MapReduce Book引用了这个问题的大量研究

于 2013-04-09T21:01:56.470 回答