python - 将数据从 pdf 表中提取为结构化格式

Question

我想以任何结构化格式（如 html、xml、json）抓取 pdf 表数据。我正在使用 python 。我首先使用pdftotext命令行功能将 pdf 转换为文本。但我无法区分pdf中表格的数据。

pdf图片如下图所示：

score 0 · Accepted Answer

您可以使用 Camelot 从 PDF 中提取表格数据并将其导出为 CSV、Excel、JSON 或 HTML。您可以在以下位置查看文档：http ://camelot-py.readthedocs.io 。如果您可以发布指向您的 PDF 的链接，将会很有帮助。这是一个通用代码示例：

>>> import camelot
>>> tables = camelot.read_pdf('file.pdf')
>>> type(tables[0].df)
<class 'pandas.core.frame.DataFrame'>
>>> tables.export('file.csv', f='csv')

免责声明：我是图书馆的作者。

python - 将数据从 pdf 表中提取为结构化格式

1 回答 1

Related

Reference