hadoop - 输出收集器如何工作？

Question

我试图分析默认的 map reduce 作业，它没有定义 mapper 或 reducer。即一个使用 IdentityMapper 和 IdentityReducer 为了让自己清楚我刚刚写了我的身份缩减器

public static class MyIdentityReducer extends MapReduceBase implements Reducer<Text,Text,Text,Text> {
        @Override
        public void reduce(Text key, Iterator<Text> values,
                OutputCollector<Text, Text> output, Reporter reporter)
                throws IOException {
            while(values.hasNext()) {
                Text value = values.next();
                output.collect(key, value);
            }
        }   
    }

我的输入文件是：

$ hadoop fs -cat NameAddress.txt
Dravid Banglore
Sachin Mumbai
Dhoni Ranchi
Dravid Jaipur
Dhoni Chennai
Sehwag Delhi
Gambhir Delhi
Gambhir Calcutta

I was expecting
Dravid Jaipur
Dhoni Chennai
Gambhir Calcutta
Sachin Mumbai
Sehwag Delhi

I got
$ hadoop fs -cat NameAddress/part-00000
Dhoni   Ranchi
Dhoni   Chennai
Dravid  Banglore
Dravid  Jaipur
Gambhir Delhi
Gambhir Calcutta
Sachin  Mumbai
Sehwag  Delhi

我的观点是，由于聚合是由程序员在 reducer 的 while 循环中完成的，然后写入 outputcollector。我的印象是传递给 outputcollector 的减速器的键总是唯一的，因为如果我不聚合，最后一个键的值会覆盖前一个值。显然不是这样。有人可以给我一个更好的输出收集器，它是如何工作的以及它如何处理所有的键。我在 hadoop src 代码中看到了许多 outputcollector 的实现。我可以编写自己的输出收集器来完成我所期望的吗？

score 1 · Accepted Answer

键对于reducer 是唯一的，并且对reducer 的每次调用都有一个唯一的键值和与该键关联的所有值的可迭代。您正在做的是迭代所有传入的值并写出每个值。

因此，在您的情况下，调用可能少于数据并不重要。您仍然最终会写出所有值。

hadoop - 输出收集器如何工作？

1 回答 1

Related

Reference