这是我的 pdf 文件中的示例图像,有 75 页。
问问题
2423 次
2 回答
0
Camelot 是提取无边框表格的绝佳选择。您可以使用 flavor = stream 选项进行提取。
tables = camelot.read_pdf('sample.pdf', flavor='stream', edge_tol=500, pages='1-end')
#tables from all your pages will be stored in the tables object
tables[0].df
df.to_csv()
于 2020-06-08T08:20:43.083 回答
0
您可以使用 Python 和 tabula 模块来做到这一点。由于它是无边界的,因此您可以首先使用我的 get_area 函数(修改页码等)动态找到该区域:
from tabula import convert_into, convert_into_by_batch, read_pdf
from tabulate import tabulate
def get_area(file):
"""Set and return the area from which to extract data from within a PDF page
by reading the file as JSON, extracting the locations
and expanding these.
"""
tables = read_pdf(file, output_format="json", pages=2, silent=True)
top = tables[0]["top"]
left = tables[0]["left"]
bottom = tables[0]["height"] + top
right = tables[0]["width"] + left
# print(f"{top=}\n{left=}\n{bottom=}\n{right=}")
return [top - 20, left - 20, bottom + 10, right + 10]
在转换之前,请检查您的第一个表格的格式是否正确:
def inspect_1st_table(file: str):
df = read_pdf(
file,
# output_format="dataframe",
multiple_tables=True,
pages="all",
area=get_area(file),
silent=True, # Suppress all stderr output
)[0]
print(tabulate(df.head()))
然后,使用该区域进行表格提取,从 pdf 到 csv:
def convert_pdf_to_csv(file: str):
"""Output all the tables in the PDF to a CSV"""
convert_into(
file,
file[:-3] + "csv",
output_format="csv",
pages="all",
area=get_area(file),
silent=True,
)
如果您需要提取超过 1 个表,请再次检查它们:
def show_tables(file: str):
"""Read pdf into list of DataFrames"""
tables = read_pdf(
file, pages="all", multiple_tables=True, area=get_area(file), silent=True
)
for df in tables:
print(tabulate(df))
并将所有 pdf 表批量转换为 csv 格式:
def convert_batch(directory: str):
"""convert all PDFs in a directory"""
convert_into_by_batch(directory, output_format="csv", pages="all", silent=True)
于 2020-06-08T08:02:52.930 回答