java - 运行 Hadoop MapReduce 作业时，如何获取文件名/文件内容作为 MAP 的键/值输入？

Question

我正在创建一个程序来分析 PDF、DOC 和 DOCX 文件。这些文件存储在 HDFS 中。

当我开始我的 MapReduce 工作时，我希望 map 函数将文件名作为键，将二进制内容作为值。然后我想创建一个流阅读器，我可以将它传递给 PDF 解析器库。如何实现 Map Phase 的键/值对是文件名/文件内容？

我正在使用 Hadoop 0.20.2

这是开始工作的旧代码：

public static void main(String[] args) throws Exception {
 JobConf conf = new JobConf(PdfReader.class);
 conf.setJobName("pdfreader");

 conf.setOutputKeyClass(Text.class);
 conf.setOutputValueClass(IntWritable.class);

 conf.setMapperClass(Map.class);
 conf.setReducerClass(Reduce.class);

 conf.setInputFormat(TextInputFormat.class);
 conf.setOutputFormat(TextOutputFormat.class);

 FileInputFormat.setInputPaths(conf, new Path(args[0]));
 FileOutputFormat.setOutputPath(conf, new Path(args[1]));

 JobClient.runJob(conf);
}

我知道还有其他输入格式类型。但是有没有一个可以完全满足我的要求？我发现文档很模糊。如果有可用的，那么 Map 函数输入类型应该如何？

提前致谢！

score 8 · Accepted Answer

解决方案是创建您自己的 FileInputFormat 类来执行此操作。您可以从此 FileInputFormat 接收的 FileSplit (getPath) 访问输入文件的名称。确保否决 FileInputformat 的 isSplitable 以始终返回 false。

您还需要一个自定义的 RecordReader，它将整个文件作为单个“记录”值返回。

Be careful in handling files that are too big. You will effectively load the entire file into RAM and the default setting for a task tracker is to have only 200MB RAM available.

score 1 · Accepted Answer

作为您的方法的替代方案，可以将二进制文件直接添加到 hdfs。然后，创建一个包含所有二进制文件的 dfs 路径的输入文件。这可以使用Hadoop 的 FileSystem类动态完成。最后，再次使用 FileSystem 创建一个通过打开输入流来处理输入的映射器。

score 1 · Accepted Answer

You can use WholeFileInputFormat (https://code.google.com/p/hadoop-course/source/browse/HadoopSamples/src/main/java/mr/wholeFile/?r=3)

In mapper name of the file u can get by this command:

public void map(NullWritable key, BytesWritable value, Context context) throws 
IOException, InterruptedException 
{       

Path filePath= ((FileSplit)context.getInputSplit()).getPath();
String fileNameString = filePath.getName();

byte[] fileContent = value.getBytes();

}

java - 运行 Hadoop MapReduce 作业时，如何获取文件名/文件内容作为 MAP 的键/值输入？

3 回答 3

Related

Reference