pdf - 从 PDF 中提取表格数据

Question

是否有任何一致的方法可以从 PDF 文件中提取表格？有什么工具吗？

到目前为止我做了什么：

我已经尝试过pdftotext工具。它有一个转换为 HTML 布局的选项。

这有什么问题：

HTML 输出中不保留表格信息
我期待<table>标签，但一切都在<p>标签下。

PDF 文档中是否有任何标记来指示表格结构？像<table>,<tr>和<td>在 HTML 中？

如果“是”，任何指向此的指针都会有所帮助。如果“否”，有关此事实的明确信息也很有帮助。

score 19 · Accepted Answer

What you could do however, is use the pdftotext -layout input.pdf output.txt. It prints the pdf in a text file and contains the original layout. There are no tags, but with a bit of nifty scripting (perl / php / whatever), you can recover the data from the tables.

If you're working on a single page, you're probably better off doing it manually, but if you (like me) have to work on 100's or 1000's of pages, it's about the best you can get. I've been looking around for a long time and can't find any better pdf-2-text tool than pdftotext.

There is a bit of inconsistency in the output, not all similar pdf tables produce a similar looking txt output, but that makes your scripting a little more interesting.

score 13 · Accepted Answer

如果 PDF 文档遗漏了将内容标记为表格、行、单元格等的信息（称为标签），则没有一致的方法可以从 PDF 文档中提取表格。大多数情况下，PDF 文档不包含这些标签。这些标签通常用于使 PDF 可访问，以便例如可以大声朗读。PDF 有效不需要这些标签。

pdf - 从 PDF 中提取表格数据

2 回答 2

Related

Reference