python - pdfminer 不会从填写的 pdf 表单中提取数据

Question

我正在尝试使用pdfminer提取 pdf 表单中填写的内容。访问 pdf 的说明如下：

转到https://www.ffiec.gov/nicpubweb/nicweb/InstitutionProfile.aspx?parID_Rssd=1073757&parDT_END=99991231
点击从上数第四个报告旁边的“创建报告” （即银行组织系统性风险报告（FR Y-15））
点击“您的财务报告请求已准备就绪”

为了提取蓝色的内容，我从这篇文章中复制了代码：

from pdfminer.pdfparser import PDFParser
from pdfminer.pdfdocument import PDFDocument
from pdfminer.pdftypes import resolve1

filename = 'FRY15_1073757_20160630.PDF'
fp = open(filename, 'rb')
parser = PDFParser(fp)
doc = PDFDocument(parser)
fields = resolve1(doc.catalog['AcroForm'])['Fields']

for i in fields:
    field = resolve1(i)
    name, value = field.get('T'), field.get('V')
    print '{0}: {1}'.format(name, value)

这没有按预期提取数据字段 - 没有打印任何内容。我在另一个 pdf 上尝试了相同的代码并且它有效，所以我怀疑失败可能与第一个 pdf 的安全设置有关，如下所示

对于代码工作的第二个 pdf，安全设置对所有操作显示“允许”。我还尝试使用 pdfminer 的 pdf2txt.py 功能（请参见此处），但原始 pdf 表单（这是我想要的）字段中填写的数据不在转换后的文本文件中；仅转换了 pdf 的“平面”不可填充部分。有趣的是，如果我使用 Adobe Reader 的Save As Text将 pdf 转换为文本文件，则可填充部分在转换后的文本文件中。这就是我一直在做的绕过失败的代码。

知道如何直接从 pdf 表单中提取数据吗？谢谢。

score 0 · Accepted Answer

I can only explain what the problem is but cannot present a solution because I have no working Python knowledge.

Your code iterates over the immediate children of the AcroForm Fields array and expect them to represent the form fields.

While this expectation often is fulfilled, it actually only represents a special case: Form fields are arranged as a tree structure with that Fields array as root element, e.g. in case of your sample document there is large tree:

Thus, you have to descend into the structure, not merely iterate over the immediate children of Fields, to find all form fields.

python - pdfminer 不会从填写的 pdf 表单中提取数据

1 回答 1

Related

Reference