pdf - 从pdf获取文本位置

Question

我想知道pdf页面中所有单词的位置。我一直试图在网上找到一些东西，但找不到。谁能帮助我应该使用哪个库（最好在 java 平台中）？

score 0 · Accepted Answer

看看本教程：http

: //www.luigimicco.altervista.org/doku.php/en/doc/pdf_structure 基本上，使用 PDFBox，您可以使用 PDFContent

InputStream is = yourPDFDocument.getDocumentCatalog().getPages().get(yourPage).getContents();

然后，搜索X Y Td您要查找的行。

我真的很确定有一种更简单的方法可以做到这一点，但由于我在一个项目的内容流上工作了很多，我只知道这种方式。
在PDFBox 的 javaDocs中搜索更多详细信息！

我希望这能帮到您：）

score 0 · Accepted Answer

您可以使用Textricator，但不幸的是，文档没有得到维护，因此很难使其更有趣的方面发挥作用。但是，要仅查看文本位置，您可以使用简单文本模式。

./textricator.bat text --pages=2 xxx.pdf

# output is a long list of CSV properties for the document, including the OCR read text and the x,y coordinates of it.

pdf - 从pdf获取文本位置

2 回答 2

Related

Reference