python - 如何使用 python-camelot 从同一目录中的多个 PDF 中提取数据？

Question

我正在尝试从多个 pdf 中的多个表中提取数据并将其保存为 csv 格式。我做了研究，发现 python-camelot 是一个很好的提取工具。我试过了，它在单个 pdf 上工作得很好。但是，我有超过 50 个相同格式的 PDF，所以我决定使用 For 循环遍历所有文件，但它不起作用，并且我得到一个错误文件在目录中找不到。你能帮忙吗？这是代码：

import tkinter 
import camelot
import os

directory = 'C:\\Users\\Alr\\Desktop\\test\\'
files = [ filename for filename in os.listdir(directory)]
for i in range (len(files)):
    tables = camelot.read_pdf(files[i], pages='5,6,7')
    tables.export(files[i], f='csv', compress=True) # json, excel, html, sqlite
    tables.to_csv(files[i]+'.csv')

score 2 · Accepted Answer

正如评论中所建议的，问题在于 os.listdir 仅返回文件名而不返回完整路径。

你可以试试这个：

import tkinter 
import camelot
import glob

directory = 'C:\\Users\\Alr\\Desktop\\test\\*.pdf'
files = [filename for filename in glob.glob(directory)]

for pdf_filepath in files:
    csv_filepath=pdf_filepath.replace('.pdf','.csv')
    tables = camelot.read_pdf(pdf_filepath, pages='5,6,7')

    # the following lines seem to be duplicate
    tables.export(csv_filepath, f='csv', compress=True) # json, excel, html, sqlite
    tables.to_csv(csv_filepath)

python - 如何使用 python-camelot 从同一目录中的多个 PDF 中提取数据？

1 回答 1

Related

Reference