hadoop - Size of map output partitions?

Question

Let's assume that we have 3 mappers (m1, m2 and m3) and 2 reducers (r1 and r2).

Each reducer fetches its input partitions from the generated files by each mapper.

From the job history, I can extract the total input for each reduce task, but I would like to know the contribution of each mapper to this reducer input ?

For example, the reducer r1 will receive an INPUT_r1 such as:

INPUT_r1 = ( partition fetched from m1 ) + ( partition fetched from m2 ) + ( partition fetched from m3 )

I would like to know the size of those partitions from mappers ?

score 0 · Accepted Answer

为了从映射器中找到分区的大小，需要考虑几件事。

首先，我们应该明白，在 Hadoop 中，分区器在组合器之前执行，因此如果您的逻辑中有组合器，您将需要考虑它......如果它影响您查找大小的尝试。如果您发现尺寸与我在这里建议的不同，这很重要。

其次，默认分区HashPartitioner器为每个 reducer 分配大致相同数量的键。使用的方法是：

public int getPartition(K2 key, V2 value, int numReduceTasks) {

     return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
}

请注意，partitioner 只考虑 key 而忽略 value，这可能导致分发给 reducer 的数据分布不均匀。

我要找出大小的方法是在您HashPartitioner或自定义分区器附近设置一个计数器，并说明每个所收集的键值对大小。然后为每个分区器打印此值。您可能需要跟踪每个分区将其数据发送到何处，因为分区器本身不知道他们将数据发送给谁。

MapReduce Book引用了这个问题的大量研究

hadoop - Size of map output partitions?

1 回答 1

Related

Reference