pdf - 从pdf中提取数据

Question

如何从 pdf 文件中提取数据，主要是数据表等，是否有任何免费或开源工具可以直接进行。我必须处理大量文件

score 0 · Accepted Answer

是的，您可以在某种程度上使用 lucene 3.x 库和 pdfbox 0.7 从 pdf 文件中提取文本

但是从 pdf 提取中，您无法将某些图像和某些格式转换为二进制和垃圾代码

但是你可以得到纯文本

File f = new File("filename");

FileInputStream fis=new FileInputStream(f);

PDFParser parser=new PDFParser(fis);

parser.parse();

PDDocument pd=parser.getPDDocument();

PDFTextStripper pst=new PDFTextStripper();

String pdftext=pst.getText(pd);

为此，您需要下载两个 jar 文件 1) lucene-core-3.0.3 jar 2) pdfbox-0.7.3 jar

我会帮你的，别担心

score 0 · Accepted Answer

对于基本文本提取，如果您可以访问命令行实用程序，请尝试使用pdftotext或pdftohtml. 您也可以使用该strings命令。

pdf - 从pdf中提取数据

2 回答 2

Related

Reference