python - 按行而不是按列从 pdf 文件中读取表格

Question

我正在尝试从 PDF 文件中提取所有文本。我正在使用在线 PDF，它们包括表格。但是，此代码有效，当它到达 PDF 中的表格时，表格中的文本按列而不是按行打印，这会弄乱我的数据。有没有办法让表格按行读取，而不必单独浏览表格？我仍然需要 PDF 中的所有文本一起打印。我正在使用python。

def getTextFromPDF(url):
    open = urllib.request.urlopen(url).read()
    memoryFile = io.BytesIO(open)
    
    resource_manager = PDFResourceManager()
    fake_file_handle = io.StringIO()
    converter = TextConverter(resource_manager, fake_file_handle, laparams=LAParams())
    page_interpreter = PDFPageInterpreter(resource_manager, converter)
    
    
    with memoryFile as fh:
    
        for page in PDFPage.get_pages(fh,
                                      caching=True,
                                      check_extractable=True):
            page_interpreter.process_page(page)
    
        text = fake_file_handle.getvalue()
    
    # close open handles
    converter.close()
    fake_file_handle.close()
    return text

score -1 · Accepted Answer

此答案适用于遇到带有图像的 pdf 并需要使用 OCR 的任何人。我找不到可行的现成解决方案；没有什么能给我所需的准确性。

以下是我发现可行的步骤。

使用https://poppler.freedesktop.org/中的 pdfimages 将 pdf 页面转换为图像。

使用 Tesseract 检测旋转并使用 ImageMagick mogrify 修复它。

使用 OpenCV 查找和提取表格。

使用 OpenCV 从表中查找并提取每个单元格。

使用 OpenCV 裁剪和清理每个单元格，这样就没有会混淆 OCR 软件的噪音。

使用 Tesseract 对每个单元格进行 OCR。

将每个单元格的提取文本组合成您需要的格式。

我编写了一个 python 包，其中包含可以帮助完成这些步骤的模块。

回购：https ://github.com/eihli/image-table-ocr

文档和来源：https ://eihli.github.io/image-table-ocr/pdf_table_extraction_and_ocr.html

有些步骤不需要代码，它们利用 pdfimages 和 tesseract 等外部工具。我将为需要代码的几个步骤提供一些简短的示例。

查找表格：此链接在了解如何查找表格时是一个很好的参考。https://answers.opencv.org/question/63847/how-to-extract-tables-from-an-image/

python - 按行而不是按列从 pdf 文件中读取表格

1 回答 1

Related

Reference