java - 在主函数中读取文件 - Hadoop

Question

我正在尝试在我的 hadoop 作业的主要方法中读取文件。不在映射器或减速器中。我正在使用带有 CUSTOM JAR 的 EMR Amazon

The command line is arguments: -files s3://[path]#source.xml

在我正在做的主要功能内部：

File file = new File("source.xml")

我不知道分布式缓存是否可用于主要功能或仅在映射器/减速器功能中。我需要使用 DistributedCache API 吗？

AWS 正在执行的行代码：

hadoop jar /mnt/var/lib/hadoop/steps/s-1YBXTPYJ2YK44/JobTeste_SomenteLeitura.jar -files s3://stoneagebrasil/TesteBVS/sources.xml

怎么能做到这一点？

score 2 · Accepted Answer

尝试，

FileSystem fs = FileSystem.get(configuration);
Path path = new Path("test.txt");

读取文件，

BufferedReader br = new BufferedReader(new InputStreamReader(
                fs.open(path)));
        String line;
        line = br.readLine();
        while (line != null) {
            System.out.println(line);
            line = br.readLine();
        }

score 0 · Accepted Answer

到目前为止，我发现不可能在 hadoop 驱动程序（主函数）内的分布式缓存中读取文件。这是因为在我开始工作后文件将被分发（复制到从节点）。

解决方案是直接从 S3 读取文件。

java - 在主函数中读取文件 - Hadoop

2 回答 2

Related

Reference