python - tabula 和 camelot 未检测到表

Question

我试图从我认为格式不正确的 PDF 中提取表格。这些 PDF 中的表格具有表格格式，但没有用垂直边框正确括起来。我将附上示例 pdf 并与两个库一起输出。当我尝试使用 tabula 进行表格检测时，pdf 中的所有页面上都会返回一个空白数据帧。

输入 0 表示单页，1 表示全部，2 表示特定页面：2 输入页码：25 在此页面上未按表格找到表格。

当我使用 camelot 时，我使用时同样没有响应flovor='lattice'

输入 0 表示单页，1 表示所有页面，2 表示表格中的页面由 tabula 检测，3 表示特定页面：3 输入 0 表示 lattice 或 1 表示流：0 输入页码：25 在此页面上没有找到表由 camelot .

当我使用时flovor='stream'，我得到一个数据框，其中每一行都使用制表符分隔的数据逐行读取，但它也会在该数据框中包含普通文本。

输入 0 表示单页，输入 1 表示所有页面，输入 2 表示表格中的页面由 tabula 检测，3 表示特定页面：3 输入 0 表示 lattice 或 1 表示流：1 输入页码：25

如果不存在垂直封闭表格行，我只需要一种有效的方法来检测表格并提取相同的数据。如果表格是由垂直和水平线包围的正确格式，那么 tabula 和 camelot 库都可以正常工作。

score 0 · Accepted Answer

此方法可能对您有所帮助： https ://camelot-py.readthedocs.io/en/master/user/advanced.html#specify-column-separators

您可以通过传递 x 坐标找到指定 camelot 的垂直分隔符，首先您应该使用 camelot 中的“.plot()”方法查看 pdf 中的表格并记下您希望垂直分隔符所在的 x 坐标然后像下面这样传递它们：

# to get the x-coordinates
tables = camelot.read_pdf('your_pdf.pdf')
camelot.plot(tables[0], kind='text').show()

#to pass the x-coordinates
camelot.read_pdf('your_pdf.pdf', flavor='stream', columns=['x1,x2'])

score -1 · Accepted Answer

tabula 和 camelot 未检测到表

我最近一直在努力从 PDF 中提取表格。

Tabula和camelot对我也不起作用，但pdfplumber得到了我需要的结果。

import pdfplumber
pdf = pdfplumber.open(filepath)
table = pdf.pages[1].extract_table(table_settings=
{"vertical_strategy": "text", "horizontal_strategy": "text"})
df = pd.DataFrame(table, columns=table)
df.to_csv(outfile2, mode='a', index=False)

python - tabula 和 camelot 未检测到表

2 回答 2

tabula 和 camelot 未检测到表

Related

Reference