0

我正在尝试使用 AsposePdf 在 PDF 文件中搜索字符串。

这就是我正在做的事情:

String path = "C:/Windows/Fonts";
List list = Document.getLocalFontPaths();
list.add(path);
Document.setLocalFontPaths(list);
Document pdfDocument = new Document("myFile.pdf");
PageCollection pages = pdfDocument.getPages();
TextAbsorber textAbsorber = new TextAbsorber
  (new TextExtractionOptions(TextExtractionOptions.TextFormattingMode.Raw));  

for(int i = 1; i <= pages.size(); i++){
    Page currentPage = pdfDocument.getPages().get_Item(i);
    currentPage.accept(textAbsorber);
    String abText = textAbsorber.getText();
    String[] abArray = abText.trim().split("\n");
    for (String txtArray : abArray) {
         if (txtArray.contains("SomeText")) {
                //do something
              }
        }
 }

NullPointerException 在:currentPage.accept(textAbsorber);

错误堆栈跟踪:

java.lang.NullPointerException
    at com.aspose.pdf.internal.p51.z11.m2(Unknown Source)
    at com.aspose.pdf.internal.p51.z11.m7(Unknown Source)
    at com.aspose.pdf.internal.p51.z13.m1(Unknown Source)
    at com.aspose.pdf.internal.p51.z13.m1(Unknown Source)
    at com.aspose.pdf.internal.p51.z13.m6(Unknown Source)
    at com.aspose.pdf.internal.p51.z13.<init>(Unknown Source)
    at com.aspose.pdf.internal.p51.z13.<init>(Unknown Source)
    at com.aspose.pdf.TextAbsorber.visit(Unknown Source)
    at com.aspose.pdf.Page.accept(Unknown Source)

可能是什么原因?

4

1 回答 1

0

您无需拆分或修剪 PDF 文件中的字符串即可提取任何文本。Aspose.PDF API 支持高效的文本提取。请尝试使用以下代码片段从 PDF 文档中提取文本。

// Open document
Document pdfDocument = new Document("input.pdf");

// Create TextAbsorber object to find all instances of the input search phrase
TextFragmentAbsorber textFragmentAbsorber = new TextFragmentAbsorber("SEARCH STRING");

// Accept the absorber for first page of document
pdfDocument.getPages().accept(textFragmentAbsorber);

// Get the extracted text fragments into collection
TextFragmentCollection textFragmentCollection = textFragmentAbsorber.getTextFragments();

// Loop through the Text fragments
for (TextFragment textFragment : (Iterable<TextFragment>) textFragmentCollection) {
    // Iterate through text segments
    for (TextSegment textSegment : (Iterable<TextSegment>) textFragment.getSegments()) {
        System.out.println("Text :- " + textSegment.getText());
    }
}

有关文本提取的更多信息,您可以访问从 PDF 文档的页面中搜索和获取文本。如果您遇到任何问题,请与我们分享源 PDF 文件,同时提及您要提取的文本。

PS: 我与 Aspose 一起担任开发人员宣传员。

于 2018-05-05T07:05:41.650 回答