java - 我们如何从没有页眉和页脚的PDF文件中提取文本内容

Question

我们如何从 PDF 文件中提取文本内容，我们使用 pdfbox 从 PDF 文件中提取文本，但我们不需要页眉和页脚。我正在使用以下java代码。

PDFTextStripper stripper = null;
  try {
    stripper = new PDFTextStripper();
   } catch (Exception e) {
      // TODO Auto-generated catch block
      e.printStackTrace();
   }
     stripper.setStartPage(pageCount);
     stripper.setEndPage(pageCount);
   try {
      String pageText = stripper.getText(document);
       System.out.println(pageText);  
    } catch (Exception e) {
     // TODO Auto-generated catch block
     e.printStackTrace();
 }

score 5 · Accepted Answer

You have tagged this as an itext/itextpdf question, yet you are using PdfBox. That's confusing.

You also claim that your PDF file has headers and footers. This would imply that your PDF is a Tagged PDF and that the header and the footer are marked as artifacts. If that is the case, than you should take advantage of the Tagged nature of the PDF, and extract the PDF as is done in the ParseTaggedPdf example:

TaggedPdfReaderTool readertool = new TaggedPdfReaderTool();
PdfReader reader = new PdfReader(StructuredContent.RESULT);
readertool.convertToXml(reader, new FileOutputStream(RESULT));
reader.close();

If this doesn't result in anything, you clearly don't have a Tagged PDF in which case there are no headers and footers in your document from a technical point of view. You may see headers and footers with your human eyes, but that doesn't mean that a machine sees these headers and footers. To a machine, it's just text like any other text in the page.

The ExtractPageContentArea example shows how we can define a rectangle that excludes the header and the footer when parsing for the content.

PdfReader reader = new PdfReader(pdf);
PrintWriter out = new PrintWriter(new FileOutputStream(txt));
Rectangle rect = new Rectangle(70, 80, 490, 580);
RenderFilter filter = new RegionTextRenderFilter(rect);
TextExtractionStrategy strategy;
for (int i = 1; i <= reader.getNumberOfPages(); i++) {
    strategy = new FilteredTextRenderListener(new LocationTextExtractionStrategy(), filter);
    out.println(PdfTextExtractor.getTextFromPage(reader, i, strategy));
}
out.flush();
out.close();
reader.close();

In this case, we have examined the document manually and we noticed that the actual text is always added inside the rectangle new Rectangle(70, 80, 490, 580). The header is added above Y coordinate 580 and below coordinate 80. By using the RegionTextRenderFilter we can extract the content excluding the content that doesn't overlap with the rectangle we have defined.

java - 我们如何从没有页眉和页脚的PDF文件中提取文本内容

1 回答 1

Related

Reference