hadoop - 为什么 reducer 在 hadoop map/reduce 中有不同的 Input/Output 键和值？

Question

由于 Map/Reduce 应用程序的性质，reduce函数可能会被多次调用，因此 Input/Output 键值必须与 MongoDB 的 Map/Reduce 实现相同。我想知道为什么在 Hadoop 实现中它是不同的？（我最好说它允许不同）

org.apache.hadoop.mapreduce.Reducer<KEYIN,VALUEIN,KEYOUT,VALUEOUT>

第二个问题：hadoop如何知道reduce函数的输出应该在下次运行时再次返回reduce还是写入HDFS？例如：

public class MyReducer extends Reducer<Text, IntWritable, Text, IntWritable>
    public void reduce(Text key, Iterable<IntWritable> values, Context context) {
        context.write(key, value) /* this key/value will be returned to reduce in next run or will be written to HDFS? */
    } 
}

score 2 · Accepted Answer

考虑输入是文档名称（作为键）和文档行（值）的示例，结果是行长度的 STDDEV（标准偏差）。
概括 - 聚合类型不必匹配输入数据的类型。所以Hadoop把自由留给了开发者。
对于您的第二个问题 - Hadoop 没有类似于 MongoDB 增量 MapReduce 的机制，因此 reducer 的结果始终保存到 HDFS（或其他 DFS）并且永远不会返回到 reduce。

hadoop - 为什么 reducer 在 hadoop map/reduce 中有不同的 Input/Output 键和值？

1 回答 1

Related

Reference