java - WordCount MapReduce 给出了意想不到的结果

Question

我在mapreduce中尝试这个java代码来计算wordcount，在reduce方法完成后，我想显示唯一出现最大次数的单词。

为此，我创建了一些名为 myoutput、mykey 和 completeSum 的类级别变量。

我正在用 close 方法写入这些数据，但最后我得到了意想不到的结果。

public class WordCount {

public static class Map extends MapReduceBase implements
        Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);

        while (tokenizer.hasMoreTokens()) {
            word.set(tokenizer.nextToken());
            output.collect(word, one);
        }

    }
}

static int completeSum = -1;
static OutputCollector<Text, IntWritable> myoutput;
static Text mykey = new Text();

public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, IntWritable> {

    public void reduce(Text key, Iterator<IntWritable> values,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }

        if (completeSum < sum) {
            completeSum = sum;
            myoutput = output;
            mykey = key;
        }


    }

    @Override
    public void close() throws IOException {
        // TODO Auto-generated method stub
        super.close();
        myoutput.collect(mykey, new IntWritable(completeSum));
    }
}

public static void main(String[] args) throws Exception {

    JobConf conf = new JobConf(WordCount.class);
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(Map.class);
    // conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, new Path(args[0]));
    FileOutputFormat.setOutputPath(conf, new Path(args[1]));

    JobClient.runJob(conf);

}
}

输入文件数据

one 
three three three
four four four four 
 six six six six six six six six six six six six six six six six six six 
five five five five five 
seven seven seven seven seven seven seven seven seven seven seven seven seven

结果应该是

six 18

但是我得到了这个结果

three 18

通过结果，我可以看到总和是正确的，但关键不是。

如果有人可以在这些 map 和 reduce 方法上提供很好的参考，那将非常有帮助。

score 1 · Accepted Answer

您观察到的问题是由于参考别名。引用的对象key与新内容一起用于多次调用，从而更改mykey引用同一对象的内容。它以最后一个缩减键结束。这可以通过复制对象来避免，如下所示：

mykey = new Text(key);

但是，您应该仅从输出文件中获取结果，因为static变量不能由分布式集群中的不同节点共享。它只能在独立模式下工作，违背了 map-reduce 的目的。

最后，使用全局变量，即使在独立模式下，如果使用并行本地任务，很可能会导致竞争（参见MAPREDUCE-1367和MAPREDUCE-434）。

java - WordCount MapReduce 给出了意想不到的结果

输入文件数据

结果应该是

但是我得到了这个结果

如果有人可以在这些 map 和 reduce 方法上提供很好的参考，那将非常有帮助。

1 回答 1

Related

Reference