hadoop - 在 hadoop mapreduce 中分离输出文件

Question

我的问题可能已经被问过了，但我找不到我的问题的明确答案。

我的 MapReduce 是一个基本的 WordCount。我当前的输出文件是：

// filename : 'part-r-00000'
789  a
755  #c   
456  d
123  #b

如何更改输出文件名？

然后，是否有可能有 2 个输出文件：

// First output file
789  a
456  d

// Second output file
123  #b
755  #c

这是我的减少类：

public static class SortReducer extends Reducer<IntWritable, Text, IntWritable, Text> {

    public void reduce(IntWritable key, Text value, Context context) throws IOException, InterruptedException {

        context.write(key, value);

    }
}

这是我的分区程序类：

public class TweetPartitionner extends Partitioner<Text, IntWritable>{

    @Override
    public int getPartition(Text a_key, IntWritable a_value, int a_nbPartitions) {
        if(a_key.toString().startsWith("#"))
            return 1;
        return 0;
    }


}

非常感谢！

score 1 · Accepted Answer

关于如何更改输出文件名的其他问题，您可以查看http://hadoop.apache.org/docs/stable/api/org/apache/hadoop/mapreduce/lib/output/MultipleOutputs.html #write（java.lang.String，K，V）。

score 0 · Accepted Answer

在您的工作文件集中

job.setNumReduceTasks(2);

从映射器发出

编写一个分区器，将分区器添加到作业配置中，在分区器中检查键是否以#return 1 else 0开头

在减速器中交换键和值

hadoop - 在 hadoop mapreduce 中分离输出文件

2 回答 2

Related

Reference