python - 如何使用camelot从pdf中提取表格？

Question

我想在 python 3 中使用 camelot 从 pdf 中提取所有表格。

import camelot
# PDF file to extract tables from
file = "./pdf_file/ooo.pdf"
tables = camelot.read_pdf(file)
# number of tables extracted
print("Total tables extracted:", tables.n)
# print the first table as Pandas DataFrame
print(tables[0].df)
# export individually
tables[0].to_csv("./pdf_file/ooo.csv")

然后我从pdf的第一页只得到一张桌子。如何从pdf文件中提取整个表格？

score 1 · Accepted Answer

tables = camelot.read_pdf(file, pages='1-end')

如果未指定 pages 参数，则 Camelot 仅分析第一页。如需更好的解释，请参阅官方文档。

score 0 · Accepted Answer

为了使用 camelot 提取 pdf 表，您必须使用以下代码。您必须使用流参数，因为它非常强大，可以检测几乎所有的 pdf 表。此外，如果您在提取时遇到问题，则必须将 row_tol 和 edge_tol 参数添加为参数。例如 row_tol = 0 和 edge_tol=500。

pdf_archive = camelot.read_pdf(file_path, pages="all", flavor="stream")

for page, pdf_table in enumerate(pdf_archive):           
    print(pdf_archive[page].df)

python - 如何使用camelot从pdf中提取表格？

2 回答 2

Related

Reference