java - 使用 Apache POI 和 Apache PDFBox 阅读文档、pdf 文件时出现错误定位的文本框

Question

我正在尝试通过使用Apache POI（用于 doc、docx）和Apache PDFBox（用于 pdf）库将它们转换为单个字符串来读取和处理 Java 中的 .doc、.docx、.pdf 文件。
在遇到文本框之前它工作正常。如果格式是这样的：

第1段
文本框1
第2段
文本框2
第3段

那么输出应该是：
第1段文本框1第2段文本框2第3段
但我得到的输出是：
第1段第2段第3段文本框1文本框2

它似乎是在末尾添加文本框，而不是在它应该在的地方，即段落之间。这个问题在 doc 和 pdf 文件中都有。这意味着 POI 和 PDFBox 这两个库都出现了同样的问题。

读取pdf文件的代码是：

    void pdf（字符串文件）抛出 IOException {
        //初始化文件
        文件 myFile = 新文件（文件）；
        PDDocument pdDoc = null;
        尝试 {
            //加载PDF
            pdDoc = PDDocument.load(myFile);
            //创建提取器
            PDFTextStripper pdf = new PDFTextStripper();
            //提取文本
            输出 = pdf.getText(pdDoc);
        }
        最后 {
            如果（pdDoc！= null）
                //关闭文档
                pdDoc.close();
        }
    }

doc文件的代码是：

    无效文档（字符串文件）抛出 FileNotFoundException，IOException {
        文件 myFile = null;
        WordExtractor 提取器 = null ;
        //初始化文件
        我的文件 = 新文件（文件）；
        //创建文件输入流
        FileInputStream fis=new FileInputStream(myFile.getAbsolutePath());
        //打开文档
        HWPFDocument 文档=新 HWPFDocument(fis);
        //创建提取器
        extractor = new WordExtractor(document);
        //从文档中获取文本
        输出 = extractor.getText();
    }

score 3 · Accepted Answer

3

对于 PDFBox，请执行以下操作： pdf.setSortByPosition(true);

于 2012-10-06T01:58:29.390 回答

score 0 · Accepted Answer

试试下面的pdf代码。以类似的方式，您也可以尝试 for doc。

void extractPdfTexts(String file) {
    File myFile = new File(file);
    String output;
    try (PDDocument pdDocument = PDDocument.load(myFile)) {
        PDFTextStripper pdfTextStripper = new PDFTextStripper();
        pdfTextStripper.setSortByPosition(true);
        output = pdfTextStripper.getText(pdDocument);
        System.out.println(output);
    } catch (InvalidPasswordException e) {
        e.printStackTrace();
    } catch (IOException e) {
        e.printStackTrace();
    }
}

java - 使用 Apache POI 和 Apache PDFBox 阅读文档、pdf 文件时出现错误定位的文本框

2 回答 2

Related

Reference