pdf-scraping - 使用 camelot 从 PDF 中提取表格数据时，没有从 PDF 中提取标题

Question

我正在使用 camelot 进行表数据提取，但是标题没有被提取为 PDF 的一部分。

下面附上目标PDF链接和目标表在第3页和第4页，需要提取。

https://drive.google.com/file/d/1xniTIwpnNIdA_k4xvEARlVH97Lk-K2Yr/view?usp=sharing

其中一张表如下所示

我看过 camelot 文档，我认为问题与“检测短线”有关

https://camelot-py.readthedocs.io/en/master/user/advanced.html#detect-short-lines

但是无法通过调整line_size_scaling参数来解决问题。

请协助。

score 3 · Accepted Answer

我使用 . 在第 3 页上绘制了检测到的表格边界$ camelot -p 3 lattice -plot contour 007.pdf。看起来 Camelot 没有在检测到的表格边界中包含标题行 [错误 1]（见下图）。然后我尝试使用table_areas关键字参数 withflavor='lattice'但它没有包含指定表格边界中的行 [bug 2]。我已在问题跟踪器上将这些添加为#200和#201。

您仍然可以使用table_areas关键字参数 withflavor='stream'来获取表格。

使用命令行：$ camelot -p 3 --output 007.csv --format csv stream -T 60,770,520,400 007.pdf

使用 API：tables = camelot.read_pdf('007.pdf', pages='3', flavor='stream', table_areas=['60,770,520,400'])

您可以使用此处描述的步骤找到表格边界坐标：https ://camelot-py.readthedocs.io/en/master/user/advanced.html#visual-debugging

希望有帮助！

pdf-scraping - 使用 camelot 从 PDF 中提取表格数据时，没有从 PDF 中提取标题

1 回答 1

Related

Reference