python-3.x - 在python中ocr多页pdf

Question

我正在使用 pytesseract 对图像进行 OCR。我有 3-4 页长的声明 pdf。我需要一种方法将它们转换为多个 .jpg/.png 图像并在这些图像上逐个进行 OCR。截至目前，我正在将单个页面转换为图像，然后运行

text=str(pytesseract.image_to_string(Image.open("imagename.jpg"),lang='eng'))

之后我使用正则表达式来提取信息并创建一个数据框。所有页面的正则表达式逻辑都是相同的。可以理解的是，如果我可以循环读取图像文件，那么对于任何格式相同的 pdf，该过程都可以自动化。

score 2 · Accepted Answer

PyMuPDF 将是您循环浏览图像文件的另一种选择。以下是如何实现这一目标：

import fitz
from PIL import Image
import pytesseract 

input_file = 'path/to/your/pdf/file'
pdf_file = input_file
fullText = ""

doc = fitz.open(pdf_file) # open pdf files using fitz bindings 
### ---- If you need to scale a scanned image --- ###
zoom = 1.2 # scale your pdf file by 120%
mat = fitz.Matrix(zoom, zoom)
noOfPages = doc.pageCount 

for pageNo in range(noOfPages):
    page = doc.loadPage(pageNo) # number of pages
    pix = page.getPixmap(matrix = mat) # if you need to scale a scanned image
    output = '/path/to/save/image/files' + str(pageNo) + '.jpg'
    pix.writePNG(output) # skip this if you don't need to render a page

    text = str(((pytesseract.image_to_string(Image.open(output)))))
    fullText += text

fullText = fullText.splitlines() # or do something here to extract information using regex

这非常方便，具体取决于您希望如何处理 pdf 文件。有关 PyMuPDF 的更多详细信息，这些链接可能会有所帮助：Tutorial on PyMuPDF and git for PyMuPDF

希望这可以帮助。

编辑使用 PyMuPDF 执行此操作的另一种更直接的方法是，如果您有干净的 PDF 文件格式，则直接解释反向转换的文本，page = doc.loadPage(pageNo)只需执行以下操作即可：

blocks = page.getText("blocks")
blocks.sort(key=lambda block: block[3])  # sort by 'y1' values

for block in blocks:
    print(block[4])  # print the lines of this block

免责声明：上述使用想法blocks来自回购维护者。可以在这里找到更详细的信息：关于 git 的问题讨论

score 0 · Accepted Answer

对我来说，以下作品

from wand.api import library
from wand.image import Image
with Image(filename=r"imagepath.pdf", resolution=300) as img:


    library.MagickResetIterator(img.wand)
    for idx in range(library.MagickGetNumberImages(img.wand)):
        library.MagickSetIteratorIndex(img.wand, idx)

    img.save(filename="output.tiff")

现在的问题是读取 tiff 文件中的每一页。因为如果我提取为

text=str(pytesseract.image_to_string(Image.open("test.tiff"),lang='eng'))

它只会对第一页进行 OCR

python-3.x - 在python中ocr多页pdf

2 回答 2

Related

Reference