1

我正在使用 iText v5.5.1 读取 PDF 并从中渲染文本:

pdfReader = new PdfReader(new CloseShieldInputStream(is));
pdfParser = new PdfReaderContentParser(pdfReader);

int maxPageNumber = pdfReader.getNumberOfPages();
int pageNumber = 1;

StringBuilder sb = new StringBuilder();

SimpleTextExtractionStrategy extractionStrategy = new SimpleTextExtractionStrategy();

while (pageNumber <= maxPageNumber) {
    pdfParser.processContent(pageNumber, extractionStrategy);

    sb.append(extractionStrategy.getText());

    pageNumber++;
}

一个 PDF 文件上引发以下异常:

java.lang.ClassCastException: com.itextpdf.text.pdf.PdfNumber cannot be cast to com.itextpdf.text.pdf.PdfLiteral
    at com.itextpdf.text.pdf.parser.PdfContentStreamProcessor.processContent(PdfContentStreamProcessor.java:382)
    at com.itextpdf.text.pdf.parser.PdfReaderContentParser.processContent(PdfReaderContentParser.java:80)

该 PDF 文件似乎已损坏,但也许它的内容仍然有意义......

4

2 回答 2

1

的确

该 PDF 文件似乎已损坏

所有页面的内容流如下所示:

/GS1 gs
q
595.00 0 0 

看起来他们都被提前切断了,因为最后一行不是一个完整的操作。这当然可以像 iText 一样使解析器中断。

此外,内容应该更长,因为即使它们的压缩流的大小也比它的长度大一点。这表明流在字节级别中断。

查看 PDF 文件的字节,不禁会注意到

  1. 即使在二进制流中,代码 13 和 10 也只会一起出现,并且
  2. 交叉参考偏移值小于实际位置。

因此,我假设此 PDF 已使用将其作为文本数据处理的传输方法进行传输,特别是用现在在文件中无处不在的 CR LF 替换任何类型的假定换行符(CR 或 LF 或 CR LF)(CR = 回车 = 13;LF = 换行 = 10)。此类替换将自动破坏任何压缩数据流,例如文件中的内容流。

不幸的是,虽然...

但也许它的内容仍然有意义

不多。每个页面分别关联一个大图像。考虑到内容流的小尺寸和大图像尺寸,我假设 PDF 仅包含扫描的页面。但是由于上面提到的替换,图像也被破坏了。

于 2014-10-16T09:33:20.237 回答
0

This isn't the best solution, but I had this exact problem and unfortunately can't share the exact PDFs I was having issues with.

I made a fork of itextpdf that catches the ClassCastException and just skips PdfObjects that it takes issue with. It prints to System.out what the text contained and what type itextpdf thinks it was. I haven't been able to map this out to some systemic problem with my PDFs (someone smarter than me will need to do that), and this exception only happens once in a blue moon. Anyway, in case it helps anyone, this fork at least doesn't crash your code, lets you parse the majority of your PDFs, and gives you a bit of info on what types of bytestrings seem to give itextpdf indigestion.

https://github.com/njhwang/itextpdf

于 2015-04-27T19:40:43.883 回答