hadoop - wordcount 文件常用词

Question

我已经设法以非分布式模式运行 Hadoop wordcount 示例；我在一个名为“part-00000”的文件中得到输出；我可以看到它列出了所有组合输入文件的所有单词。

在跟踪 wordcount 代码后，我可以看到它需要行并根据空格分割单词。

我正在想办法只列出多个文件中出现的单词及其出现次数？这可以在 Map/Reduce 中实现吗？-添加-这些更改是否合适？

      //changes in the parameters here

    public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {

         // These are the original line; I am not using them but left them here...
      private final static IntWritable one = new IntWritable(1);
      private Text word = new Text();

                    //My changes are here too

        private Text outvalue=new Text();
        FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
        private String filename = fileSplit.getPath().getName();;



      public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {

        String line = value.toString();
        StringTokenizer tokenizer = new StringTokenizer(line);
        while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());

          //    And here        
              outvalue.set(filename);
          output.collect(word, outvalue);

        }

      }

    }

score 0 · Accepted Answer

您可以修改映射器以将单词作为键输出，然后将 Text 作为表示单词来自的文件名的值。然后在你的reducer中，你只需要删除文件名并输出那些单词出现在多个文件中的条目。

获取正在处理的文件的文件名取决于您是否使用新的 API（mapred 或 mapreduce 包名称）。我知道对于新的 API，您可以使用getInputSplit方法从 Context 对象中提取映射器输入拆分（然后可能是InputSplita FileSplit，假设您使用的是TextInputFormat）。对于旧 API，我从未尝试过，但显然您可以使用名为map.input.file

这也是引入组合器的好选择 - 从同一个映射器中删除多个单词出现。

更新

因此，针对您的问题，您尝试使用一个名为reporter 的实例变量，该变量在映射器的类scopt 中不存在，修改如下：

public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {
  // These are the original line; I am not using them but left them here...
  private final static IntWritable one = new IntWritable(1);
  private Text word = new Text();

  //My changes are here too
  private Text outvalue=new Text();
  private String filename = null;

  public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
    if (filename == null) {
      filename = ((FileSplit) reporter.getInputSplit()).getPath().getName();
    }

    String line = value.toString();
    StringTokenizer tokenizer = new StringTokenizer(line);
    while (tokenizer.hasMoreTokens()) {
      word.set(tokenizer.nextToken());

      //    And here        
      outvalue.set(filename);
      output.collect(word, outvalue);
    }
  }
}

（真的不知道为什么SO不尊重上面的格式......）

hadoop - wordcount 文件常用词

1 回答 1

Related

Reference