java - Hadoop：如何将 reducer 输出合并到单个文件中？

Question

我知道 shell 中的“getmerge”命令可以完成这项工作。

但是如果我想在作业后通过 HDFS API for java 合并这些输出，我该怎么办？

我真正想要的是 HDFS 上的单个合并文件。

我唯一能想到的就是在那之后开始一份额外的工作。

谢谢！

score 10 · Accepted Answer

但是，如果我想在 HDFS API for java 的作业之后合并这些输出，我该怎么办？

猜测，因为我自己没有尝试过，但我认为您正在寻找的方法是FileUtil.copyMerge，这是您运行-getmerge命令时 FsShell 调用的方法。 FileUtil.copyMerge将两个 FileSystem 对象作为参数 - FsShell 使用 FileSystem.getLocal 来检索目标 FileSystem，但我看不出有任何理由不能在目标上使用 Path.getFileSystem 来获取 OutputStream

话虽如此，我认为它不会让你受益匪浅——合并仍在本地 JVM 中进行；所以你并没有真正节省太多，-getmerge其次是-put.

score 4 · Accepted Answer

您可以通过在代码中设置单个 Reducer 来获得单个输出文件。

Job.setNumberOfReducer(1);

将满足您的要求，但成本高

或者

Static method to execute a shell command. 
Covers most of the simple cases without requiring the user to implement the Shell interface.

Parameters:
env the map of environment key=value
cmd shell command to execute.
Returns:
the output of the executed command.

org.apache.hadoop.util.Shell.execCommand(String[])

java - Hadoop：如何将 reducer 输出合并到单个文件中？

2 回答 2

Related

Reference