hadoop - 在 Amazon Elastic MapReduce 和 S3 中读取参数文件

Question

我正在尝试在 Amazon Elastic MapReduce 系统中运行我的 hadoop 程序。我的程序从本地文件系统获取一个输入文件，其中包含程序运行所需的参数。但是，由于该文件通常是从本地文件系统中读取的，因此FileInputStream在 AWS 环境中执行该任务时会失败，并显示未找到参数文件的错误。请注意，我已经将文件上传到 Amazon S3。我该如何解决这个问题？谢谢。下面是我用来读取参数文件并因此读取文件中参数的代码。

FileInputStream fstream = new FileInputStream(path);
            FileInputStream os = new FileInputStream(fstream);
            DataInputStream datain = new DataInputStream(os);
            BufferedReader br = new BufferedReader(new InputStreamReader(datain));

            String[] args = new String[7];

            int i = 0;
            String strLine;
            while ((strLine = br.readLine()) != null) {
                args[i++] = strLine;
            }

score 1 · Accepted Answer

如果您必须从本地文件系统读取文件，您可以将 EMR 作业配置为使用boostrap 操作运行。在该操作中，只需使用s3cmd或类似工具将文件从 S3 复制到本地文件。

您还可以通过 Hadoop FileSystem 类来读取文件，因为我很确定 EMR 支持这样的直接访问。例如：

FileSystem fs = FileSystem.get(new URI("s3://my.bucket.name/"), conf);
DataInputStream in = fs.open(new Path("/my/parameter/file"));

score 0 · Accepted Answer

您可以将此文件添加到分布式缓存中，如下所示：

...
String s3FilePath = args[0];
DistributedCache.addCacheFile(new URI(s3FilePath), conf);
...

稍后，在 mapper/reducer 的 configure() 中，您可以执行以下操作：

...
Path s3FilePath;
@Override
public void configure(JobConf job) {
s3FilePath = DistributedCache.getLocalCacheFiles(job)[0];
FileInputStream fstream = new FileInputStream(s3FilePath.toString());
...
}

score 0 · Accepted Answer

我还没有尝试过 Amazon Elastic，但它看起来像是分布式缓存的经典应用程序。-files您使用选项（如果您实现Tool/ ToolRunner）或方法添加文件做缓存job.addCacheFile(URI uri)，并像本地存在一样访问它。

hadoop - 在 Amazon Elastic MapReduce 和 S3 中读取参数文件

3 回答 3

Related

Reference