hadoop - 完全取消默认输出目录 - MapReduce

Question

我有一个用于编写多个输出的代码org.apache.hadoop.mapreduce.lib.output.MultipleOutputs。

Reducer 将结果写入预先创建的位置，因此我不需要默认的 o/p 目录（其中包含_history和_SUCCESS 目录）。

在再次运行我的工作之前，我必须每次都删除它们。

所以我删除了这TextOutputFormat.setOutputPath(job1,new Path(outputPath));条线。但是，这给了我（预期的）错误org.apache.hadoop.mapred.InvalidJobConfException: Output directory not set

司机等级：

MultipleOutputs.addNamedOutput(job1, "path1", TextOutputFormat.class, Text.class,LongWritable.class);
MultipleOutputs.addNamedOutput(job1, "path2", TextOutputFormat.class, Text.class,LongWritable.class);
LazyOutputFormat.setOutputFormatClass(job1,TextOutputFormat.class);

减速机类：

if(condition1)
    mos.write("path1", key, new LongWritable(value), path_list[0]);
else
    mos.write("path2", key, new LongWritable(value), path_list[1]);

是否有避免指定默认输出目录的解决方法？

score 3 · Accepted Answer

我不认为_SUCCESS是一个目录，而另一个history目录驻留在_logs目录中。

首先TextOutputFormat.setOutputPath(job1,new Path(outputPath));很重要，因为当作业运行时，Hadoop 将此路径作为工作目录来创建临时文件等用于不同任务（_temporary dir）。这个 _temporary 目录和文件最终会在作业结束时被删除。_SUCCESS 文件和历史目录实际上是工作目录下的内容，并在作业成功完成后保留。_SUCCESS 文件是一种表示作业实际运行成功的标志。请看这个链接。

您的文件_SUCCESS的创建是由 TextOutputFormat您实际使用的类完成的，而后者又使用FileOutputComitter类。FileOutputCommiter 类定义了一个这样的函数——

 public static final String SUCCEEDED_FILE_NAME = "_SUCCESS";
/**
   * Delete the temporary directory, including all of the work directories.
   * This is called for all jobs whose final run state is SUCCEEDED
   * @param context the job's context.
   */
  public void commitJob(JobContext context) throws IOException {
    // delete the _temporary folder
    cleanupJob(context);
    // check if the o/p dir should be marked
    if (shouldMarkOutputDir(context.getConfiguration())) {
      // create a _success file in the o/p folder
      markOutputDirSuccessful(context);
    }
  }

// Mark the output dir of the job for which the context is passed.
  private void markOutputDirSuccessful(JobContext context)
  throws IOException {
    if (outputPath != null) {
      FileSystem fileSys = outputPath.getFileSystem(context.getConfiguration());
      if (fileSys.exists(outputPath)) {
        // create a file in the folder to mark it
        Path filePath = new Path(outputPath, SUCCEEDED_FILE_NAME);
        fileSys.create(filePath).close();
      }
    }
  }

由于 markOutputDirSuccessful() 是私有的，因此您必须改写 commitJob() 以绕过SUCCEEDED_FILE_NAME创建过程并实现您想要的。

下一个目录_logs非常重要，如果您想稍后使用 hadoop HistoryViewer 来实际获取 Job 运行情况的报告。

我认为，当您使用相同的输出目录作为另一个作业的输入时，由于Hadoop 中设置了过滤器，文件 *_SUCCESS* 和目录 *_logs* 将被忽略。

此外，当您为 MultipleOutputs 定义命名输出时，您可以改为写入您在 TextOutputFormat.setOutputPath() 函数中提到的输出路径内的子目录，然后将该路径用作您将运行的下一个作业的输入。

我实际上并没有看到 _SUCCESS 和 _logs 会如何打扰您？

谢谢

score 2 · Accepted Answer

问题很老了，仍然分享答案，

这个答案很适合问题中的场景。

定义您的 OutputFormat 以表示您不期望任何输出。你可以这样做：

job.setOutputFormat(NullOutputFormat.class);

或者

你也可以使用LazyOutputFormat

import org.apache.hadoop.mapreduce.lib.output.LazyOutputFormat; 
LazyOutputFormat.setOutputFormatClass(job, TextOutputFormat.class);

学分@charlesmenguy

score 1 · Accepted Answer

您正在运行什么版本的 Hadoop？

对于快速解决方法，您可以以编程方式设置输出位置并调用 FileSystem.delete 以在作业完成时将其删除。

hadoop - 完全取消默认输出目录 - MapReduce

3 回答 3

Related

Reference