python - 如何获取使用 tabula-py 提取的表格是哪个页面？

Question

我目前正在使用tabula.read_pdf()从 pdf 中提取表格。但是，没有关于该表来自哪个页面的信息。一种方法是获取总页数并通过传入pages参数 for 来迭代每一页tabula.read_pdf()。然而，这是极其低效的。以下是一些解释，我在这里使用示例 pdf http://www.annualreports.com/HostedData/AnnualReports/PDF/NASDAQ_AMZN_2019.pdf

%%time
for i in range(1,88):
    tables = read_pdf(pdf_path, pages=i, stream=True)
# CPU times: user 803 ms, sys: 686 ms, total: 1.49 s
# Wall time: 3min 4s

%%time
tables = read_pdf(pdf_path, pages='all', stream=True)
# CPU times: user 402 ms, sys: 171 ms, total: 573 ms
# Wall time: 21.2 s

score 0 · Accepted Answer

您可以使用 camelot 代替 tabula。

Camelot 的一个很酷的功能是，您还可以获得每个表的“解析报告”，其中给出了准确度指标、找到该表的页面以及表中存在的空白百分比。

file = "your_file_path"
tables = camelot.read_pdf(file, pages = "1-end")
# get the 3rd-indexed-table
tables[3].df
# get the information of the third table, you will find the page
tables[3].parsing_report

参考：http ://theautomatic.net/2019/05/24/3-ways-to-scrape-tables-from-pdfs-with-python/

python - 如何获取使用 tabula-py 提取的表格是哪个页面？

1 回答 1

Related

Reference