java - 从PDF中提取时如何删除不必要的文本

翻译自：https://stackoverflow.com/questions/40179854 2016-10-21T14:57:49.073

95 次

我正在使用 Apache PDFBox 从科学论文中提取文本。我可以从 PDF 文件中提取文本。

下面是从 PDF 中提取纯文本的代码。例如，要提取的数据：https ://www.aclweb.org/anthology/P/P16/P16-2015.pdf 。

我只想获取标题和正文，而不是第一页上的参考文献和作者姓名 - Yanhui Gu 1 Zhenglu Yang 2∗ .... - > {xingtian.shi }@sap.com。

    PDFTextStripper pdfStripper = null;
    PDDocument pdDoc = null;
    COSDocument cosDoc = null;
    int count = 1;
    String directory = "Result";
    File folder = new File("data");
    File[] listOfFiles = folder.listFiles();
    for (File file : listOfFiles) {
        if (file.isFile()) {
            try {
              String t;
              String text = getText(file);
              t=text.replaceAll("\n|\r|\t", "");
              printFile(directory+"\\data"+count+".txt",t);
              count++;
             } catch (IOException e) {}
        }
    }

java - 从PDF中提取时如何删除不必要的文本

0 回答 0

Related

Reference