hadoop - 如何在 Hadoop 中将文件作为键值对读取和处理

Question

我正在尝试将以下数据作为 Hadoop 中的键值对读取。

name: "Clooney, George", release: "2013", movie: "Gravity",
name: "Pitt, Brad", release: "2004", movie: "Ocean's 12",
name: Clooney, George", release: "2004", movie: "Ocean's 12",
name: "Pitt, Brad", release: "1999", movie: "Fight Club"

我需要如下输出：

name: "Clooney, George", movie: "Gravity, Ocean's 12",
name: "Pitt, Brad", movie: "Ocean's 12, Fight Club",

我写了一个Mapper和Reducer如下：

  public static class MyMapper
       extends Mapper<Text, Text, Text, Text>{

    private Text word = new Text();

    public void map(Text key, Text value, Context context
                    ) throws IOException, InterruptedException {
      StringTokenizer itr = new StringTokenizer(value.toString(),",");
  while (itr.hasMoreTokens()) {
    word.set(itr.nextToken());
    context.write(key, word);
  }
 }
}
  public static class MyReducer
       extends Reducer<Text,Text,Text,Text> {
    private Text result = new Text();

    public void reduce(Text key, Iterable<Text> values,
                       Context context
                       ) throws IOException, InterruptedException {
      String actors = "";
      for (Text val : values) {
         actors += val.toString();
      }
      result.set(actors);
      context.write(key, result);
    }
  }

我还添加了以下配置细节：

Configuration conf = new Configuration();
conf.set("mapreduce.input.keyvaluelinerecordreader.key.value.separator", ",");

我得到以下输出：

name: "Clooney   George" release: "2013" movie: "Gravity" George" release: "2004" movie: "Ocean's 12"
name: "Pitt  Brad" release: "2004" movie: "Ocean's 12" Brad" release: "1999" movie: "Fight Club"

似乎我什至无法正确读取基本的键值对。Hadoop 中的键值对处理如何？有人可以详细说明这一点并指出我哪里出错了吗？

谢谢。TM值

score 1 · Accepted Answer

您的问题涉及KeyValueTextInputFormat不尊重输入记录中的引号，而只是查找您定义的第一个分隔符（逗号），并将键定义为该字符之前的所有内容，并将值定义为第一个分隔符之后的所有内容。

因此，您的映射器将作为第一条记录的输入键/值提供以下内容：

钥匙：name: "Clooney
价值：George", release: "2013", movie: "Gravity",

要解决此问题，我认为您应该切换回仅使用 a TextInpurFormat，然后将提取逻辑委托给映射器的 map 方法。

hadoop - 如何在 Hadoop 中将文件作为键值对读取和处理

1 回答 1

Related

Reference