apache - 了解 Hadoop wordcount 示例

Question

示例代码在这里 http://wiki.apache.org/hadoop/WordCount

我理解逻辑，但是，我注意到在 main 函数中，它只指定了输入和输出路径，但是，它从未指定什么是键和值。

map 和 reduce 函数是如何解决这个问题的？

public void map(LongWritable key, Text value, Context context) throws IOException, InterruptedException 

public void reduce(Text key, Iterable<IntWritable> values, Context context)

score 2 · Accepted Answer

由于您了解 Mapper 和 Reducer 的形式为 Key1、Value1、Key2、Value2，并且 Key1 和 Value1 是输入键值类型，而 Key2 和 Value2 是输出类型，我将解释其余部分。

在 main 函数中，您会看到一行内容，

job.setInputFormatClass(TextInputFormat.class);

现在，这决定了如何读取输入文件。如果您查看源代码，TextInputFormat您会看到（在第 41 行）它使用LineRecordReader( source ) 将文件分解为键值对。这里将行偏移设置为键，将行本身设置为值。

但就像你说的，这不是自动完成的。您可以通过编写自己的自定义输入格式和记录阅读器类来控制此行为。

希望这能消除您的疑虑。

score 0 · Accepted Answer

Mapper和Reducer类的接口强制map和reduce函数的类型：

public static class Map extends Mapper<LongWritable, Text, Text, IntWritable> {
    ...
}

和

public static class Reduce extends Reducer<Text, IntWritable, Text, IntWritable> {
    ...
}

它们都是输入键值类型和输出类型K1, V1, K2, V2的形式。K1, V1K2, V2

apache - 了解 Hadoop wordcount 示例

2 回答 2

Related

Reference