java - 在 hadoop mapreduce 应用程序中访问来自其他文件系统的文件以及 hdfs 文件

Question

我知道我们可以从普通的 java 应用程序中调用 map-reduce 作业。现在，在我的情况下，map-reduce 作业必须处理 hdfs 上的文件以及其他文件系统上的文件。在 hadoop 中，我们是否可以在访问其他文件系统的文件的同时使用 hdfs 上的文件。那可能吗？

所以基本上我的意图是我有一个大文件，我想把它放在 HDFS 中进行并行计算，然后将该文件的块与其他一些文件进行比较（我不想放在 HDFS 中，因为它们需要被访问一次作为全长文件。

score 2 · Accepted Answer

It should be possible to access non-HDFS file system from mapper/reducer tasks just like any other tasks. One thing to note is that if there a are say 1K mappers and each of them will try to open the non-HDFS file, this might lead to a bottle neck based on the type of the external file system. The same is applicable with mappers pulling data from a database also.

score 1 · Accepted Answer

您可以使用分布式缓存将文件分发给您的映射器，他们可以在他们的configure()方法中打开和读取文件（不要读取它们，map()因为它会被多次调用。）

编辑

为了在 map reduce 作业中从本地文件系统访问文件，您可以在设置作业配置时将这些文件添加到分布式缓存中。

JobConf job = new JobConf();
DistributedCache.addCacheFile(new URI("/myapp/lookup.dat#lookup.dat"), job);

MapReduce 框架将确保您的映射器可以访问这些文件。

public void configure(JobConf job) {
    // Get the cached archives/files
    Path[] localFiles = DistributedCache.getLocalCacheFiles(job);

    // open, read and store for use in the map phase.
}

并在您的工作完成后删除文件。

java - 在 hadoop mapreduce 应用程序中访问来自其他文件系统的文件以及 hdfs 文件

2 回答 2

Related

Reference