pdf - 使用 pdf-clown 评论或突出显示两列 pdf

Question

我已经通过谷歌搜索/so/forums for pdfClown/pdfbox 搜索可能的解决方案，并将问题发布在 SO。

问题：我一直在尝试找到一种解决方案来突出显示跨越 pdf 文档中多行的文本。pdf 可以有一/两列页面。

通过使用 pdf-clown，我能够突出显示短语，前提是所有单词都出现在同一行中。pdfBox 为单个单词创建了 XML，我找不到短语/行的解决方案。

如果有的话，请为 pdf-clown 提出解决方案。（或）任何其他能够突出显示 pdf 中多行文本的工具，具有 JAVA 兼容性。

我无法理解类似问题的答案，但是 iText，有什么帮助吗？： Multiline markup annotations with iText

score 0 · Accepted Answer

目前，多列文本（PDF Clown 0.1.2）不支持提取：当前算法收集位于同一水平基线上的文本，而不评估列之间可能存在的间隙。

自动多列布局检测是可能的，但有些棘手，因为 PDF 本质上是（你知道的）一种非结构化图形格式。尽管如此，我正在考虑对此进行一些实验，以便至少处理最常见的情况。

同时，我可以建议您尝试一个有效的解决方法（这意味着您处理其列放置在可预测区域的文档）：为每一列执行单独的文本提取，指示 TextExtractor 查看相应的页面区域，然后将所有这些部分提取结果放在一起并应用您的过滤器。

score 0 · Accepted Answer

可以使用 pdfbox 获取 pdf 文档中每个单词的坐标，这里是它的代码：

import java.io.*;
import org.apache.pdfbox.exceptions.InvalidPasswordException;

import org.apache.pdfbox.pdmodel.PDDocument;
import org.apache.pdfbox.pdmodel.PDPage;
import org.apache.pdfbox.pdmodel.common.PDStream;
import org.apache.pdfbox.util.PDFTextStripper;
import org.apache.pdfbox.util.TextPosition;

import java.io.IOException;
import java.util.List;

public class PrintTextLocations extends PDFTextStripper {

    public PrintTextLocations() throws IOException {
        super.setSortByPosition(true);
    }

    public static void main(String[] args) throws Exception {

        PDDocument document = null;
        try {
            File input = new File("C:\\path\\to\\PDF.pdf");
            document = PDDocument.load(input);
            if (document.isEncrypted()) {
                try {
                    document.decrypt("");
                } catch (InvalidPasswordException e) {
                    System.err.println("Error: Document is encrypted with a password.");
                    System.exit(1);
                }
            }
            PrintTextLocations printer = new PrintTextLocations();
            List allPages = document.getDocumentCatalog().getAllPages();
            for (int i = 0; i < allPages.size(); i++) {
                PDPage page = (PDPage) allPages.get(i);
                System.out.println("Processing page: " + i);
                PDStream contents = page.getContents();
                if (contents != null) {
                    printer.processStream(page, page.findResources(), page.getContents().getStream());
                }
            }
        } finally {
            if (document != null) {
                document.close();
            }
        }
    }

    protected void processTextPosition(TextPosition text) {
        System.out.println("String[" + text.getXDirAdj() + ","
                + text.getYDirAdj() + " fs=" + text.getFontSize() + " xscale="
                + text.getXScale() + " height=" + text.getHeightDir() + " space="
                + text.getWidthOfSpace() + " width="
                + text.getWidthDirAdj() + "]" + text.getCharacter());
    }
}

pdf - 使用 pdf-clown 评论或突出显示两列 pdf

2 回答 2

Related

Reference