java - 如何通过Hadoop mapreduce WordCount对重复频率最高的单词列表进行排序？

Question

嗨，我是 hadoop mapreduce 的新手。

你们中的任何人都可以帮我修改下面发布的代码以显示所需的输出吗？

我有一个给定的输入文件

输入：Hi my name is John.Im doing my engineering.My parents stay at California

我得到的输出为

Hi    1
my   3
name  1
is    1
is 1
John 1
doing  1
engineering 1
parents  1
stay  1
at  1
California   1

但我希望输出排序为

 my   3
 Hi   1 
 etc.....

然后显示所有其他人。其概念是显示重复次数最多的单词应先排序显示。

我在单个节点上运行此作业。我正在做这份工作

        $ hadoop jar job.jar input output

我已经开始了

        $ hadoop namenode -format
        $ hadoop namenode

        $ hadoop datanode
        sbin$ ./yarn-daemon.sh start resourcemanager 
        sbin$ ./yarn-daemon.sh start resourcemanager

我正在运行 hadoop-2.0.0-cdh4.0.0

        package org.apache.hadoop.examples;

        import java.io.IOException;
        import java.util.StringTokenizer;
        import org.apache.commons.logging.Log;
        import org.apache.commons.logging.LogFactory;

        import org.apache.hadoop.conf.Configuration;
        import org.apache.hadoop.io.IntWritable;
        import org.rg.apache.hadoop.fs.Path;
        import oapache.hadoop.io.Text;
        import org.apache.hadoop.mapreduce.Job;
        import org.apache.hadoop.mapreduce.Mapper;
        import org.apache.hadoop.mapreduce.Reducer;
        import org.apache.hadoop.mapreduce.lib.input.FileInputFormat;
        import org.apache.hadoop.mapreduce.lib.output.FileOutputFormat;
        import org.apache.hadoop.util.GenericOptionsParser;

        public class WordCount {
        private static final Log LOG = LogFactory.getLog(WordCount.class);

          public static class TokenizerMapper
               extends Mapper<Object, Text, Text, IntWritable>{

            private final static IntWritable one = new IntWritable(1);
            private Text word = new Text();

            public void map(Object key, Text value, Context context
                            ) throws IOException, InterruptedException {
              StringTokenizer itr = new StringTokenizer(value.toString());
              while (itr.hasMoreTokens()) {
                word.set(itr.nextToken());
                context.write(word, one);
              }
            }
          }

          public static class IntSumReducer
               extends Reducer<Text,IntWritable,Text,IntWritable> {
            private IntWritable result = new IntWritable();

            public void reduce(Text key, Iterable<IntWritable> values,
                               Context context
                               ) throws IOException, InterruptedException {
              int sum = 0;
              //printKeyAndValues(key, values);

              for (IntWritable val : values) {
                sum += val.get();
              LOG.info("val = " + val.get());
              }
              LOG.info("sum = " + sum + " key = " + key);
              result.set(sum);
              context.write(key, result);
              //System.err.println(String.format("[reduce] word: (%s), count: (%d)", key, result.get()));
            }


          // a little method to print debug output
            private void printKeyAndValues(Text key, Iterable<IntWritable> values)
            {
              StringBuilder sb = new StringBuilder();
              for (IntWritable val : values)
              {
                sb.append(val.get() + ", ");
              }
              System.err.println(String.format("[reduce] key: (%s), value: (%s)", key, sb.toString()));
            }
          }

          public static void main(String[] args) throws Exception {
            Configuration conf = new Configuration();
            String[] otherArgs = new GenericOptionsParser(conf, args).getRemainingArgs();
            if (otherArgs.length != 2) {
              System.err.println("Usage: wordcount <in> <out>");
              System.exit(2);
            }
            Job job = new Job(conf, "word count");
            job.setJarByClass(WordCount.class);
            job.setMapperClass(TokenizerMapper.class);
            job.setCombinerClass(IntSumReducer.class);
            job.setReducerClass(IntSumReducer.class);
            job.setOutputKeyClass(Text.class);
            job.setOutputValueClass(IntWritable.class);
            FileInputFormat.addInputPath(job, new Path(otherArgs[0]));
            FileOutputFormat.setOutputPath(job, new Path(otherArgs[1]));

            System.exit(job.waitForCompletion(true) ? 0 : 1);
        }
        }

如果有人能解决这个想法，我会很棒。

score 2 · Accepted Answer

每次找到一个单词时减少计数如何？从 0 开始，您将拥有 -ve 的数字计数。最高计数应该首先出现。

java - 如何通过Hadoop mapreduce WordCount对重复频率最高的单词列表进行排序？

1 回答 1

Related

Reference