hadoop - 如何在映射器输出中收集三个参数。有什么办法

Question

我是地图Reduce和hadoop概念的新手。所以请帮忙

我有近 100 个包含这种格式数据的文件

conf/iceis/GochenouerT01a:::John E. Gochenouer::Michael L. Tyler:::Voyeurism, Exhibitionism, and Privacy on the Internet.

我应该通过 map reduce 算法来做。现在在我想显示的输出中

John E. Gochenoue Voyeurism .
John E. Gochenoue Exhibitionism 
John E. Gochenoue and 
John E. Gochenoue privacy
John E. Gochenoue on
John E. Gochenoue the
John E. Gochenoue internet   
Michael L. Tyler   Voyeurism .
Michael L. Tyler   Exhibitionism 
Michael L. Tyler   and 
Michael L. Tyler   privacy
Michael L. Tyler   on
Michael L. Tyler   the
Michael L. Tyler   internet

所以现在它是单行。所以有'n'行这样的行包含大量的名字和大量的书籍。

因此，如果我考虑一份 110 行的文档。我可以像这样输出我的映射器吗

John E. Gochenoue Voyeurism    1  
John E. Gochenoue Exhibitionism 3 
Michael L. Tyler   on           7

IE 要说它显示名称和工作，然后显示文档中单词的出现次数，最后在减少后它应该显示名称，然后显示名称对它的单词以及它在 ' 中出现的单词的组合频率n'文件。

我知道 output.collector() 但它需要两个参数

output.collect(arg0, arg1)

有什么方法可以收集三个值，如名称，单词和单词的出现

以下是我的代码

public static class Map extends MapReduceBase implements
        Mapper<LongWritable, Text, Text, IntWritable> {
    private final static IntWritable one = new IntWritable(1);
    private Text word = new Text();

    public void map(LongWritable key, Text value,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        String line = value.toString();
        /*
         * StringTokenizer tokenizer = new StringTokenizer(line); while
         * (tokenizer.hasMoreTokens()) { word.set(tokenizer.nextToken());
         * output.collect(word, one);
         */

        String strToSplit[] = line.split(":::");
        String end = strToSplit[strToSplit.length - 1];
        String[] names = strToSplit[1].split("::");
        for (String name : names) {
            StringTokenizer tokens = new StringTokenizer(end, " ");
            while (tokens.hasMoreElements()) {
                output.collect(arg0, arg1)
                System.out.println(tokens.nextElement());
            }
        }

    }
}

public static class Reduce extends MapReduceBase implements
        Reducer<Text, IntWritable, Text, IntWritable> {
    public void reduce(Text key, Iterator<IntWritable> values,
            OutputCollector<Text, IntWritable> output, Reporter reporter)
            throws IOException {
        int sum = 0;
        while (values.hasNext()) {
            sum += values.next().get();
        }
        output.collect(key, new IntWritable(sum));
    }
}

public static void main(String[] args) throws Exception {
    JobConf conf = new JobConf(example.class);
    conf.setJobName("wordcount");

    conf.setOutputKeyClass(Text.class);
    conf.setOutputValueClass(IntWritable.class);

    conf.setMapperClass(Map.class);
    conf.setCombinerClass(Reduce.class);
    conf.setReducerClass(Reduce.class);

    conf.setInputFormat(TextInputFormat.class);
    conf.setOutputFormat(TextOutputFormat.class);

    FileInputFormat.setInputPaths(conf, "/home/vishal/workspace/hw3data");
    FileOutputFormat.setOutputPath(conf,
            new Path("/home/vishal/nmnmnmnmnm"));

    JobClient.runJob(conf);
}

score 2 · Accepted Answer

诀窍是编写一个文本（hadoop 可写实现之一），其字符串内容是许多制表符分隔的值。这就是您可以轻松地在映射器和化简器之间传递复杂值的方式。

当然，更有工业实力的做法是自己写Writable。可写对象基本上是具有特殊序列化/反序列化行为的 pojo。在这种情况下，您的可写对象将具有三个属性。

score 0 · Accepted Answer

对于传递标记化字符串时的映射器类，当您想要计数时，您基本上需要对相同的键进行分组。

这意味着计算一个人使用一个词的次数将需要您生成一个看起来像这样的密钥John Smith<delimiter>Word。分隔符可以是任何你想要的。大多数人使用制表符将其保留在最终减速器输出中的 TSV。

因此，要更正您的 output.collect 语句，请将其更改为如下所示：

output.collect(new Text(name + "\t" + tokens.nextElement()), new IntWritable(1));

hadoop - 如何在映射器输出中收集三个参数。有什么办法

2 回答 2

Related

Reference