我想在我的 hadoop 2.2.0 程序中解析 PDF 文件,我发现了这个,按照它所说的,直到现在,我有这三个类:
PDFWordCount
:包含map和reduce函数的主类。(就像本机 hadoop wordcount示例一样,但TextInputFormat
我没有使用我的PDFInputFormat
课程。PDFRecordReader extends RecordReader<LongWritable, Text>
:这是这里的主要工作。特别是我把我的initialize
功能放在这里以获得更多说明。public void initialize(InputSplit genericSplit, TaskAttemptContext context) throws IOException, InterruptedException { System.out.println("initialize"); System.out.println(genericSplit.toString()); FileSplit split = (FileSplit) genericSplit; System.out.println("filesplit convertion has been done"); final Path file = split.getPath(); Configuration conf = context.getConfiguration(); conf.getInt("mapred.linerecordreader.maxlength", Integer.MAX_VALUE); FileSystem fs = file.getFileSystem(conf); System.out.println("fs has been opened"); start = split.getStart(); end = start + split.getLength(); System.out.println("going to open split"); FSDataInputStream filein = fs.open(split.getPath()); System.out.println("going to load pdf"); PDDocument pd = PDDocument.load(filein); System.out.println("pdf has been loaded"); PDFTextStripper stripper = new PDFTextStripper(); in = new LineReader(new ByteArrayInputStream(stripper.getText(pd).getBytes( "UTF-8"))); start = 0; this.pos = start; System.out.println("init has finished"); }
(您可以查看我
system.out.println
的 s 进行调试。此方法无法转换genericSplit
为FileSplit
。我在控制台中看到的最后一件事是:hdfs://localhost:9000/in:0+9396432
这是
genericSplit.toString()
PDFInputFormat extends FileInputFormat<LongWritable, Text>
:这只是new PDFRecordReader
在createRecordReader
方法中创建。
我想知道我的错误是什么?
我需要额外的课程吗?