我试图解析一个 pdf 文件并获取它的元数据和文本。我仍然没有得到想要的结果。我确定这是一个愚蠢的错误,但我看不到它。文件d.pdf存在,它位于项目的根文件夹中。导入也是正确的。
public class MultiParse {
public static void main(final String[] args) throws IOException,
SAXException, TikaException {
Parser parser = new AutoDetectParser();
File f = new File("d.pdf");
System.out.println("------------ Parsing a PDF:");
extractFromFile(parser, f);
}
private static void extractFromFile(final Parser parser,
final File f ) throws IOException, SAXException,
TikaException {
BodyContentHandler handler = new BodyContentHandler(10000000);
Metadata metadata = new Metadata();
InputStream is = TikaInputStream.get(f);
parser.parse(is, handler, metadata, new ParseContext());
for (String name : metadata.names()) {
System.out.println(name + ":\t" + metadata.get(name));
}
}
}
输出:没有错误,但..也不多:(
------------ Parsing a PDF:
Content-Type: application/pdf