sorting - Hadoop MapReduce：返回文本文件中单词的排序列表

Question

所以我的任务是返回一个文本文件中包含的所有单词的字母排序列表，同时保留重复项。

{成为或不成为} -→ {成为或不成为}

我的想法是把每个词作为关键和价值。这样，由于 hadoop 对键进行排序，它们将自动按字母顺序排序。在 Reduce 阶段，我只需将具有相同键的所有单词（基本上相同的单词）附加到一个 Text 值。

   public class WordSort {

   public static class Map extends Mapper<LongWritable, Text, Text, Text> {

   private Text word = new Text();

   public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException {
      String line = value.toString();
      StringTokenizer tokenizer = new StringTokenizer(line);
      while (tokenizer.hasMoreTokens()) {
        word.set(tokenizer.nextToken());
        // transform to lower case
        String lower = word.toString().toLowerCase();
        context.write(new Text(lower), new Text(lower));
      }
    }
  }

  public static class Reduce extends Reducer<Text, Text, Text, Text> {

  public void reduce(Text key, Iterable<Text> values, Context context) throws IOException, InterruptedException {
      String result = "";
      for (Text value : values){
         res += value.toString() + " ";
      }
      context.write(key, new Text(result));
    }
  }

但是我的问题是，我如何简单地返回输出文件中的值？目前我有这个：

be be be 
not not 
or or
to to to

因此，在每一行中，我都有键，然后是值，但我只想返回值以便得到这个：

be be
not 
or 
to to

这甚至可能吗，还是我必须从每个单词的值中删除一个条目？

score 0 · Accepted Answer

免责声明：我不是 Hadoop 用户，但我用 CouchDB 做了很多 Map/Reduce。

如果您只需要键，为什么不发出一个空值？

此外，听起来您根本不想减少它们，因为您想为每次出现都获取一个密钥。

score 0 · Accepted Answer

刚刚尝试使用 Hadoop - The Definitive Guide 中的 MaxTemperature 示例，下面的代码有效

context.write(null, new Text(result));

sorting - Hadoop MapReduce：返回文本文件中单词的排序列表

2 回答 2

Related

Reference