hadoop - 运行 Hadoop MapReduce，是否可以在 HDFS 之外调用外部可执行文件

Question

在我的映射器中，我想调用安装在 HDFS 之外的工作节点上的外部软件。这可能吗？做这个的最好方式是什么？

我知道这可能会剥夺 MapReduce 的一些优势/可扩展性，但我想在 HDFS 内进行交互并在我的映射器中调用编译/安装的外部软件代码来处理一些数据。

score 5 · Accepted Answer

Mappers (and reducers) are like any other process on the box- as long as the TaskTracker user has permission to run the executable, there is no problem doing so. There are a few ways to call external processes, but since we are already in Java, ProcessBuilder seems a logical place to start.

EDIT: Just found that Hadoop has a class explicitly for this purpose: http://hadoop.apache.org/common/docs/current/api/org/apache/hadoop/util/Shell.html

score 0 · Accepted Answer

这当然是可行的。您可能会发现最好使用Hadoop Streaming。正如该网站上所说：

Hadoop 流是 Hadoop 发行版附带的实用程序。该实用程序允许您使用任何可执行文件或脚本作为映射器和/或减速器来创建和运行映射/缩减作业。

我倾向于从 Hadoop Streaming 内部的外部代码开始。根据您的语言，可能有很多很好的例子来说明如何在 Streaming 中使用它；一旦你进入了你选择的语言，如果需要，你通常可以将数据输出到另一个程序。除了让外层与 Hadoop Streaming 一起工作之外，我已经让不同语言的几层程序运行得很好，而不需要额外的努力，就像我在普通的 Linux 机器上运行它一样。

hadoop - 运行 Hadoop MapReduce，是否可以在 HDFS 之外调用外部可执行文件

2 回答 2

Related

Reference