hadoop - MapReduce 在 reducer 之间分配作业

Question

我开发了一个小型 mapreduce 程序。当我打开进程日志时，我看到框架创建了一个 map 和两个 reducer。我只有一个输入文件和两个输出文件。现在请告诉我

1) Number of mapper and reducer are created by framework or it can be changed?
2) Number of output files always equal to number of reducers? i.e. each reducer
   creates its   own output file?
3) How one input file is distributed among mappers? And output of one mapper is 
   distributed among multiple reducers (this is done by framework or you can change)?
4) How to manage when multiple input files are there i.e. A directory ,
   containing input files?

请回答这些问题。我是 MapReduce 的初学者。

score 4 · Accepted Answer

让我试着回答你的问题。请告诉我您认为不正确的地方-

1）mapper和reducer的数量是由框架创建的还是可以改变的？

创建的映射任务总数取决于从 HDFS 块中进行的逻辑拆分的总数。因此，固定映射任务的数量可能并不总是可能的，因为不同的文件可以有不同的大小和不同的总块数。因此，如果您使用 TextInputFormat，则大致每个逻辑拆分等于一个块，并且无法固定总映射任务的数量，因为对于每个文件，可以创建不同数量的块。

与映射器的数量不同，reducer 可以固定。

2）输出文件的数量总是等于reducer的数量？即每个reducer 创建自己的输出文件？

在某种程度上是的，但是有一些方法可以从 reducer 创建多个输出文件。例如：多个输出

3) 如何在映射器中分配一个输入文件？并且一个映射器的输出分布在多个减速器中（这是由框架完成的，或者您可以更改）？

HDFS 中的每个文件都由块组成。这些块被复制并且可以保留在多个节点（机器）中。然后安排地图任务在这些块上运行。map 任务可以运行的并发级别取决于每台机器的处理器数量。例如，对于一个文件，如果计划了 10,000 个映射任务，则取决于整个集群中的处理器总数，一次只能同时运行 100 个。

默认情况下，Hadoop 使用 HashPartitioner，它计算从 Mapper 发送到框架的键的哈希码，并将它们转换为分区。

例如：

  public int getPartition(K2 key, V2 value,
                          int numReduceTasks) {
    return (key.hashCode() & Integer.MAX_VALUE) % numReduceTasks;
  }

正如您在上面看到的，分区是从根据哈希码固定的 reducer 总数中选择的。因此，如果您的 numReduceTask = 4，则返回的值将介于 0 到 3 之间。

4）当有多个输入文件时如何管理，即一个目录，包含输入文件？

Hadoop 支持由多个文件组成的目录作为作业的输入。

score 0 · Accepted Answer

正如“SSaikia_JtheRocker”所解释的，映射器任务是根据 HDFS 块上的逻辑拆分总数创建的。我想在问题 #3 中添加一些内容“一个输入文件如何在映射器之间分布？一个映射器的输出分布在多个减速器之间（这是由框架完成的，或者您可以更改）？” 例如，考虑我的字数统计程序，它计算文件中的字数，如下所示：

#

公共类 WCMapper 扩展 Mapper {

@Override
public void map(LongWritable key, Text value, Context context) // Context context is output
        throws IOException, InterruptedException {

    // value = "How Are You"
    String line = value.toString(); // This is converting the Hadoop's "How Are you" to Java compatible "How Are You"

    StringTokenizer tokenizer = new StringTokenizer (line); // StringTokenizer returns an array tokenizer = {"How", "Are", "You"}

    while (tokenizer.hasMoreTokens()) // hasMoreTokens is a method in Java which returns boolean values 'True' or 'false'
    {
        value.set(tokenizer.nextToken()); // value's values are overwritten with "How" 
        context.write(value, new IntWritable(1)); // writing the current context to local disk
        // How, 1
        // Are, 1
        // You, 1
        // Mapper will run as many times as the number of lines 
    }
}

}

#

所以在上面的程序中，对于“你好吗”这一行被 StringTokenizer 分成 3 个单词，当在 while 循环中使用它时，映射器被调用的次数与单词的数量一样多，所以这里调用了 3 个映射器。

而reducer，我们可以使用'job.setNumReduceTasks(5);'来指定我们希望我们的输出生成多少个reducer 陈述。下面的代码片段会给你一个想法。

#

公共类 BooksMain {

public static void main(String[] args) throws Exception {
    Configuration conf = new Configuration();
    // Use programArgs array to retrieve program arguments.
    String[] programArgs = new GenericOptionsParser(conf, args)
            .getRemainingArgs();
    Job job = new Job(conf);
    job.setJarByClass(BooksMain.class);
    job.setMapperClass(BookMapper.class);
    job.setReducerClass(BookReducer.class);
    job.setNumReduceTasks(5);

// job.setCombinerClass(BookReducer.class);

    job.setOutputKeyClass(Text.class);
    job.setOutputValueClass(IntWritable.class);

    // TODO: Update the input path for the location of the inputs of the map-reduce job.
    FileInputFormat.addInputPath(job, new Path(programArgs[0]));
    // TODO: Update the output path for the output directory of the map-reduce job.
    FileOutputFormat.setOutputPath(job, new Path(programArgs[1]));

    // Submit the job and wait for it to finish.
    job.waitForCompletion(true);
    // Submit and return immediately: 
    // job.submit();
}

}

#

hadoop - MapReduce 在 reducer 之间分配作业

2 回答 2

Related

Reference