我已经设法以非分布式模式运行 Hadoop wordcount 示例;我在一个名为“part-00000”的文件中得到输出;我可以看到它列出了所有组合输入文件的所有单词。
在跟踪 wordcount 代码后,我可以看到它需要行并根据空格分割单词。
我正在想办法只列出多个文件中出现的单词及其出现次数?这可以在 Map/Reduce 中实现吗?-添加-这些更改是否合适?
//changes in the parameters here
public static class Map extends MapReduceBase implements Mapper<Text, Text, Text, Text> {
// These are the original line; I am not using them but left them here...
private final static IntWritable one = new IntWritable(1);
private Text word = new Text();
//My changes are here too
private Text outvalue=new Text();
FileSplit fileSplit = (FileSplit)reporter.getInputSplit();
private String filename = fileSplit.getPath().getName();;
public void map(Text key, Text value, OutputCollector<Text, Text> output, Reporter reporter) throws IOException {
String line = value.toString();
StringTokenizer tokenizer = new StringTokenizer(line);
while (tokenizer.hasMoreTokens()) {
word.set(tokenizer.nextToken());
// And here
outvalue.set(filename);
output.collect(word, outvalue);
}
}
}