python - 从具有与复制+粘贴相同布局的 PDF 文件中获取数据

Question

我有一个我希望自动化的过程，它涉及从 PDF 文件中获取一系列表格。目前，我可以通过在任何查看器（Adobe、Sumatra、okular 等）中打开文件来做到这一点，只需 Ctrl+A、Ctrl+C、Ctrl+V 到记事本，它使每一行都与合理的对齐足够的格式，然后我可以运行一个正则表达式并将其复制并粘贴到 Excel 中以供以后需要的任何内容。

当尝试使用 python 执行此操作时，我尝试了各种模块，PDFminer 是主要的模块，例如使用此示例可以工作。但它在单个列中返回数据。其他选项包括将其作为 html table 获取，但在这种情况下，它添加了额外的拆分中间表，这使解析更加复杂，甚至偶尔会在第一页和第二页之间切换列。

我现在已经得到了一个临时解决方案，但我担心我正在重新发明轮子，因为我可能只是缺少解析器中的核心选项，或者我需要考虑 PDF 渲染器方式的一些基本选项努力解决这个问题。

关于如何处理它的任何想法？

score 1 · Accepted Answer

我最终实现了一个基于这个的解决方案，它本身是由tgray的代码修改而来的。它在我迄今为止测试过的所有情况下都能正常工作，但我还没有确定如何直接操纵 pdfminer 的参数以获得所需的行为。

score 1 · Accepted Answer

发布这个只是为了得到一段代码，它与 py35 一起用于类似 csv 的解析。列中的拆分是最简单的，但对我有用。

Crudos 以 tgray 在这个答案中作为起点。

也放入openpyxl，因为我更喜欢直接在excel中获得结果。

# works with py35 & pip-installed pdfminer.six in 2017
def pdf_to_csv(filename):
    from io import StringIO
    from pdfminer.converter import LTChar, TextConverter
    from pdfminer.layout import LAParams
    from pdfminer.pdfdocument import PDFDocument
    from pdfminer.pdfpage import PDFPage
    from pdfminer.pdfparser import PDFParser
    from pdfminer.pdfinterp import PDFResourceManager, PDFPageInterpreter

    class CsvConverter(TextConverter):
        def __init__(self, *args, **kwargs):
            TextConverter.__init__(self, *args, **kwargs)

        def end_page(self, i):
            from collections import defaultdict
            lines = defaultdict(lambda : {})
            for child in self.cur_item._objs:
                if isinstance(child, LTChar):
                    (_,_,x,y) = child.bbox
                    line = lines[int(-y)]
                    line[x] = child.get_text()
                    # the line is now an unsorted dict

            for y in sorted(lines.keys()):
                line = lines[y]
                # combine close letters to form columns
                xpos = tuple(sorted(line.keys()))
                new_line = []
                temp_text = ''
                for i in range(len(xpos)-1):
                    temp_text += line[xpos[i]]
                    if xpos[i+1] - xpos[i] > 8:
                        # the 8 is representing font-width
                        # needs adjustment for your specific pdf
                        new_line.append(temp_text)
                        temp_text = ''
                # adding the last column which also manually needs the last letter
                new_line.append(temp_text+line[xpos[-1]])

                self.outfp.write(";".join(nl for nl in new_line))
                self.outfp.write("\n")

    # ... the following part of the code is a remix of the 
    # convert() function in the pdfminer/tools/pdf2text module
    rsrc = PDFResourceManager()
    outfp = StringIO()
    device = CsvConverter(rsrc, outfp, codec="utf-8", laparams=LAParams())

    fp = open(filename, 'rb')
    parser = PDFParser(fp)
    doc = PDFDocument(parser)
    password = ""
    maxpages = 0
    caching = True
    pagenos=set()

    interpreter = PDFPageInterpreter(rsrc, device)

    for i, page in enumerate(PDFPage.get_pages(fp,
                                pagenos, maxpages=maxpages,
                                password=password,caching=caching,
                                check_extractable=True)):
        outfp.write("START PAGE %d\n" % i)
        if page is not None:
            interpreter.process_page(page)
        outfp.write("END PAGE %d\n" % i)

    device.close()
    fp.close()

    return outfp.getvalue()

fn = 'your_file.pdf'
result = pdf_to_csv(fn)

lines = result.split('\n')
import openpyxl as pxl
wb = pxl.Workbook()
ws = wb.active
for line in lines:
    ws.append(line.split(';'))
    # appending a list gives a complete row in xlsx
wb.save('your_file.xlsx')

python - 从具有与复制+粘贴相同布局的 PDF 文件中获取数据

2 回答 2

Related

Reference