itextsharp - 提取文本时出错

Question

无法将“iTextSharp.text.pdf.PdfLiteral”类型的对象转换为“iTextSharp.text.pdf.PdfNumber”类型。

代码：

StringBuilder text = new StringBuilder();

SimpleTextExtractionStrategy strategy = new SimpleTextExtractionStrategy();

for (int p = 1; p <= reader.NumberOfPages; p++)
{

    text.AppendLine(PdfTextExtractor.GetTextFromPage(reader, p, strategy));
}
reader.Close();
return text.ToString();

只有很少的 pdf 才会出现此错误。有任何想法吗？

堆栈跟踪：

   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ShowTextArray.Invoke(PdfContentStreamProcessor processor, PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.InvokeOperator(PdfLiteral oper, List`1 operands)
   at iTextSharp.text.pdf.parser.PdfContentStreamProcessor.ProcessContent(Byte[] contentBytes, PdfDictionary resources)
   at iTextSharp.text.pdf.parser.PdfReaderContentParser.ProcessContent[E](Int32 pageNumber, E renderListener)
   at iTextSharp.text.pdf.parser.PdfTextExtractor.GetTextFromPage(PdfReader reader, Int32 pageNumber, ITextExtractionStrategy strategy)
   at DCS.Common.PDF.Functions.GetTextPdf(PdfReader reader) in C:\Users\rmaldonado\Documents\Visual Studio 2008\Projects\DCS\Contract\Common\PDF\Functions.cs:line 35
   at DCS.Common.PDF.Functions.ParsePDF(Byte[] bytes) in C:\Users\rmaldonado\Documents\Visual Studio 2008\Projects\DCS\Contract\Common\PDF\Functions.cs:line 23
   at DCS.CAPPS.BLL.Common.Attachment.ReParseText() in C:\Users\rmaldonado\Documents\Visual Studio 2008\Projects\DCS\Contract\ContractBLL\Common\Common.cs:line 1120

score 1 · Accepted Answer

您的文档Mod 2.pdf的页面内容完全损坏。它实际上被严重破坏了 Adobe Preflight（来自 Acrobat 9.5.4），就像 iText 在尝试分析它时遇到错误一样。

手动检查表明，最明显的错误与注入到TJ操作的操作数数组中的操作有关，例如

[(OMB) 0.0 Tc -278.0 (Approval) 0.0 Tc -278.0 (2700-0042) ] TJ

[(AMENDMENT) 0.0 Tc -278.0 (OF) 0.0 Tc -278.0 (SOLICITATION/MODIFICATION)
 0.0 Tc -278.0 (OF) 0.0 Tc -278.0 (CONTRACT) ] TJ

这种模式继续下去，即每个非平凡的[...] TJ操作都包含注入的0.0 Tc操作。

这是错误的，参见。PDF 规范ISO 32000-1:2008的第 7.8.2 节：

在 PDF 中，运算符所需的所有操作数都应紧接在该运算符之前。运算符不返回结果，并且在运算符执行完成时不应留下操作数。

这使得PdfContentStreamProcessor.ShowTextArray.Invoke（负责处理TJ操作）遇到错误。由于TJ的操作数数组可能只包含字符串和数字，所有不是 a 的PdfString,都被强制转换为，PdfNumber但Tc运算符是PdfLiteral.

score 1 · Accepted Answer

正如@mkl 所说，PDf 中也可能存在错误。尝试将文本内容从 PDf 复制粘贴到记事本。它是空白的吗？只是检查内容是图像格式还是其他格式。并尽可能提供完整的代码。

score 1 · Accepted Answer

要从 pdf 中提取文本，请尝试使用下面给出的代码

PdfTextExtractor.GetTextFromPage(reader, p, new LocationTextExtractionStrategy())

itextsharp - 提取文本时出错

3 回答 3

Related

Reference