python - pypdf没有从pdf中提取表格

Question

我正在使用 pypdf 从 pdf 文件中提取文本。问题是 pdf 文件中的表没有被提取。我也尝试过使用 pdfminer，但我遇到了同样的问题。

score 5 · Accepted Answer

问题是 PDF 中的表格通常由绝对定位的行和字符组成，将其转换为合理的表格表示形式并非易事。

在 Python 中，PDFMiner 可能是你最好的选择。它为您提供了布局对象的树形结构，但您必须通过查看行 (LTLine) 和文本框 (LTTextBox) 的位置来解释自己的表格。这里有一些文档。

或者，PDFX尝试这样做（并且经常成功），但您必须将其用作 Web 服务（不理想，但对于偶尔的工作来说很好）。要从 Python 执行此操作，您可以执行以下操作：

import urllib2
import xml.etree.ElementTree as ET

# Make request to PDFX
pdfdata = open('example.pdf', 'rb').read()
request = urllib2.Request('http://pdfx.cs.man.ac.uk', pdfdata, headers={'Content-Type' : 'application/pdf'})
response = urllib2.urlopen(request).read()

# Parse the response
tree = ET.fromstring(response)
for tbox in tree.findall('.//region[@class="DoCO:TableBox"]'):
    src = ET.tostring(tbox.find('content/table'))
    info = ET.tostring(tbox.find('region[@class="TableInfo"]'))
    caption = ET.tostring(tbox.find('caption'))

python - pypdf没有从pdf中提取表格

1 回答 1

Related

Reference