hadoop - 组合器和分区器之间的区别

Question

我是 MapReduce 的新手，我只是无法弄清楚分区器和组合器的区别。我知道两者都在 map 和 reduce 任务之间的中间步骤中运行，并且都减少了 reduce 任务要处理的数据量。请举例说明区别。

score 15 · Accepted Answer

首先，同意@Binary nerd的评论

Combiner 可以看作是 map 阶段的 mini-reducer。在进一步分发之前，它们对映射器结果执行本地缩减。一旦执行了Combiner 功能，它就会被传递给Reducer 以进行进一步的工作。

当Partitioner我们在处理多个 Reducer 时，就会出现在图片中。因此，partitioner 决定哪个 reducer 负责特定的 key。他们基本上采用Mapper Result（如果Combiner使用则CombinerResult）并根据密钥将其发送给负责的 Reducer

使用组合器和分区器场景：

仅使用 Partitioner 方案：

例子：

组合器示例
分区器示例：

分区阶段发生在 map 阶段之后，reduce 阶段之前。分区的数量等于减速器的数量。数据根据分区函数在 reducer 之间进行分区。partitioner 和 combiner 的区别在于 partitioner 根据 reducer 的数量对数据进行划分，使得单个 partition 中的所有数据都由单个 reducer 执行。但是，combiner 的功能类似于 reducer，并处理每个分区中的数据。combiner 是对 reducer 的优化。默认分区函数是散列分区函数，其中对键进行散列。但是，根据键或值的某些其他功能对数据进行分区可能会很有用。--来源

score 8 · Accepted Answer

我认为一个小例子可以非常清楚和快速地解释这一点。

假设您有一个 MapReduce Word Count 作业，其中包含 2 个 mapper 和 1 个 reducer 。

没有合路器。

"hello hello there"=>映射器1 =>(hello, 1), (hello,1), (there,1)

"howdy howdy again"=>映射器2 =>(howdy, 1), (howdy,1), (again,1)

两个输出都到达减速器=>(again, 1), (hello, 2), (howdy, 2), (there, 1)

使用 Reducer 作为组合器

"hello hello there"=>带有组合器的mapper1 =>(hello, 2), (there,1)

"howdy howdy again"=>带有组合器的mapper2 => (howdy, 2), (again,1)

两个输出都到达减速器=>(again, 1), (hello, 2), (howdy, 2), (there, 1)

结论

最终结果是一样的，但是当使用组合器时，地图输出已经减少了。在此示例中，您只向 reducer 发送 2 个输出对而不是 3 个对。因此，您可以获得 IO/磁盘性能。这在聚合值时很有用。

Combiner 实际上是一个应用于 map() 输出的 Reducer。

如果您查看第一个Apache MapReduce 教程，恰好是我刚刚说明的 mapreduce 示例，您可以看到他们使用 reducer 作为组合器：

job.setCombinerClass(IntSumReducer.class);
job.setReducerClass(IntSumReducer.class);

hadoop - 组合器和分区器之间的区别

2 回答 2

没有合路器。

使用 Reducer 作为组合器

结论

Related

Reference