pdfbox - 如何从pdf文件中提取段落并存储其位置？

Question

我将使用 PDFBox 库提取 PDF 文件的内容。内容要逐段处理，每段我都需要它的位置进行后续处理。使用以下代码，我可以提取输入 PDF 的全部内容：

PDDocument doc = PDDocument.load(file);
PDFTextStripper stripper = new PDFTextStripper();
String txt = stripper.getText(doc);
doc.close();

我有两个问题：

我不知道如何逐段提取内容。
不知道如何存储一个段落的位置以便后续处理（例如高亮等）

谢谢。

score 0 · Accepted Answer

I use Poppler's command-line pdftohtml to extract rich-text but if you need paragraph clean then the PDF got to be a tagged-PDF. If you need the (x,y) co-ordinate of the paragraph then you need to dig deeper into Poppler. There is also Apache PDFbox Java library that can also be used. If you make an annotation in the beginning of the paragraph then you can pull out the annotation as an XML from the PDF where you will find the (x,y) co-ordinate of the annotation! Adobe puts a clever encryption into the PDF to make it undiscoverable, so it may not be easy (that's with all the legal hassles etc) to pull that out without Adobe tools.

pdfbox - 如何从pdf文件中提取段落并存储其位置？

1 回答 1

Related

Reference