.net - 使用 itextsharp 解析 PDF 文档 - 缺少展平的表单字段值

Question

我通过使用模板并填写表单字段来创建 PDF 文件。然后我将 PDF 展平以防止对其进行更改。我现在需要解析 PDF 并从表单字段中获取数据；但是，当我解析 PDF 时，缺少表单字段的文本。似乎我无法引用该字段，因为 PDF 被展平并且解析 PDF 会跳过文本所在的字段并返回

名字：姓氏：

但PDF实际上有

名字：简姓氏：Doe

如何获取表单字段曾经所在的文本？

更新

Dim text As StringBuilder = New StringBuilder()

If File.Exists(filename) Then
    Dim pdfReader As New PdfReader(filename)

    For page As Integer = 1 To pdfReader.NumberOfPages
        Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy()
        Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)

        currentText = Encoding.UTF8.GetString(ASCIIEncoding.Convert(Encoding.Default, Encoding.UTF8, Encoding.Default.GetBytes(currentText)))
            text.Append(currentText)
     Next

     pdfReader.Close()

     textBox1.Text = text.ToString()
     textBox1.SelectionStart = 0
End If

由于其中的信息，我无法发布原始文件，但我可以发布 2 个示例文件来说明我在做什么。

我正在使用这样的模板pdf... fw4.pdf

然后我用数据填充它并将它展平，所以它就像这样...... final_fw4.pdf

当我使用上面的代码解析它时，我得到了这个... parsed_pdf_text.txt
查看文件

解析后的文本中没有任何数据！

score 1 · Accepted Answer

这是基于没有文件可查看的推测。

如果通过展平，您的意思是“将表单数据放入内容中”，那么数据可能会以任何易于访问的方式消失。页面上的表单数据由小部件注释表示。要展平表单，您将获取小部件注释的给定实例的外观（或创建一个）并附加到页面内容流以包含 PDF 代码以呈现表单字段，然后最终删除注释。

这是我在您的文件中看到的内容 - 第一页有几个内容流。最后一个内容流包含以下摘录：

Q q Q q 1 0 0 1 501.46 481.92 cm /Xi0 Do Q q Q q 1 0 0 1 500.87 457.9 cm /Xi1 Do Q q Q

这是（或多或少）：

grestore
gsave
grestore
gsave
    translate(501.46, 481.92)
    XObject("Xi0")
grestore
gsave
grestore
gsave
    translate(500.87, 457.9)
    XObject("Xi1")
grestore
gsave
grestore

Xi0 是文件中的 object #1，它是一个 Form XObject，具有以下内容流：

q Q /Tx BMC q 0 0 26.03 12.33 re W n q BT 1 0 0 1 8.01 2.93 Tm /HeBo 9 Tf
1 0.59 0 0.11 k (Ja)Tj 0 g ET Q Q EMC

这是（或多或少）：

gsave
grestore
BeginMarkedContent("Tx")
gsave
    AddRectangle(0, 0, 26.03, 12.33)
    clip
    newpath
    gsave
        begintext
          TextTranslate(8.01, 2.93)
          SetFont("Helvetica-Bold", 9)
          SetCMYKColor(1, .59, 0, .11)
          DrawText("Ja")
          SetGray(0)
        endtext
    grestore
grestore
EndMarkedContent()

你的文字就在那里，它完全符合我的推测。更有趣的问题是，“为什么我使用 iTextSharp 提取文本时看不到它？我不太清楚，因为我没有在 iTextSharp 上工作，但我做到了在 Adobe Acrobat 上工作，除此之外，我在 Acrobat 1.0 中用于搜索的文本提取引擎上工作，所以我知道从 PDF 中提取文本是多么具有挑战性，而且大多数产品都做错了或做错了，或者两者兼而有之挑战。很可能，iTextSharp 遍历内容流，并在任何文本运算符上聚合操作和状态（即，“将文本放置在此字体和此颜色和渲染模式中”），但很可能它不会进行递归调用 XObjects，因此它完全丢失了通过展平表单创建的所有内容。

简短的回答很可能是 iTextSharp 中的一个错误，值得向他们报告。

通常，我会向您指出我公司用于执行此操作的工具，但目前我没有您想要的“扁平化”功能。然而。

如果我是你，我会采取自己编写代码的方法来进行扁平化。实际上，您需要迭代小部件注释，而不是将它们的外观流写入页面内容，而是编写实际的 PDF 内容。

此外，作为 PDF 爱好者，此 PDF 输出可能会更好。空的冗余 gsave/grestore 对没有任何借口，也不应该有无效的颜色变化。幸运的是，这些都是良性的。

score 1 · Accepted Answer

您对问题的分析不正确：

但是，当我解析 PDF 时，缺少表单字段的文本

不，它没有丢失。它只是不是你所期望的。如果您在 parsed_pdf_text.txt 中搜索“Ja”，您会在一个块中找到所有扁平化的条目：

Ja
Ja
Ja
8
0
1
16
28
Jane Doe 532 12 1234
100 North Cujo Street
Nome, AK  67201
4 4 9
10
11
Walmart, Nome, AK
WAL666 AB 4321

正如对您问题的评论中已经指出的那样，原因是您使用SimpleTextExtractionStrategy

Dim strategy As ITextExtractionStrategy = New SimpleTextExtractionStrategy()
Dim currentText As String = PdfTextExtractor.GetTextFromPage(pdfReader, page, strategy)

看看班级评论：

 * This renderer keeps track of the current Y position of each string.  If it detects
 * that the y position has changed, it inserts a line break into the output.  If the
 * PDF renders text in a non-top-to-bottom fashion, this will result in the text not
 * being a true representation of how it appears in the PDF.
 * 
 * This renderer also uses a simple strategy based on the font metrics to determine if
 * a blank space should be inserted into the output.

扁平化到内容中的表单信息被添加到内容流的末尾，因此文本出现在页面文本的末尾。

您可能想改用LocationTextExtractionStrategy。它的类注释表明：

 * A text extraction renderer that keeps track of relative position of text on page
 * The resultant text will be relatively consistent with the physical layout that most
 * PDF files have on screen.
 * <br>
 * This renderer keeps track of the orientation and distance (both perpendicular
 * and parallel) to the unit vector of the orientation.  Text is ordered by
 * orientation, then perpendicular, then parallel distance.  Text with the same
 * perpendicular distance, but different parallel distance is treated as being on
 * the same line.
 * <br>
 * This renderer also uses a simple strategy based on the font metrics to determine if
 * a blank space should be inserted into the output.

这仍然不是最佳的，但在你的情况下可能更好。

我现在需要解析 PDF 并从表单字段中获取数据

如果您只有有限数量的表单，您可以调查原始表单字段的位置并仅解析这些字段位置的文本。在这种情况下，将FilteredRenderListener与RegionTextRenderFilter结合使用可能会很有趣。

score 1 · Accepted Answer

可以使用 Javascript 操作在页面加载时设置文本。但无论如何，很想看到一个文件

.net - 使用 itextsharp 解析 PDF 文档 - 缺少展平的表单字段值

更新

3 回答 3

Related

Reference