有几个选项可以做到这一点,考虑到您提供的代码,更兼容的方式是将图像临时存储在该目录中,然后在使用 pytesseract 读取文本后删除它们。我创建了一个 wand 类型的图像以分别从 PDF 中提取每个图像,然后将其转换为 pytesseract 的 PIL 类型图像。这是我用于此的代码,将检测到的文本写入数组“文本”,其中每个元素都是原始 PDF 中的图像,我还更新了一些导入以使其与 Python3 兼容(cStringIO->io 和 urllib2 -> urllib.request)。
import PyPDF2
import os
import pytesseract
from wand.image import Image
from PIL import Image as PILImage
import urllib.request
import io
with urllib.request.urlopen('file:///home/user/Documents/TestDocs/test.pdf') as response:
pdf_read = response.read()
pdf_im = PyPDF2.PdfFileReader(io.BytesIO(pdf_read))
text = []
for p in range(pdf_im.getNumPages()):
with Image(filename='file:///home/user/Documents/TestDocs/test.pdf' + '[' + str(p) + ']') as img:
with Image(image = img) as converted: #Need second with to convert SingleImage object from wand to Image
converted.save(filename=tempFile_Location)
text.append(pytesseract.image_to_string(PILImage.open(tempFile_Location)))
os.remove(tempFile_Location)
或者,如果您想避免为每个图像创建和删除临时文件,您可以使用 numpy 和 OpenCV 将图像提取为 blob,将其转换为 numpy 数组,然后将其转换为 PIL 图像,以便 pytesseract 执行 OCR (参考)
import PyPDF2
import os
import pytesseract
from wand.image import Image
from PIL import Image as PILImage
import urllib.request
import io
import numpy as np
import cv2
with urllib.request.urlopen('file:///home/user/Documents/TestDocs/test.pdf') as response:
pdf_read = response.read()
pdf_im = PyPDF2.PdfFileReader(io.BytesIO(pdf_read))
text = []
for p in range(pdf_im.getNumPages()):
with Image(filename=('file:///home/user/Documents/TestDocs/test.pdf') + '[' + str(p) + ']') as img:
img_buffer=np.asarray(bytearray(img.make_blob()), dtype=np.uint8)
retval = cv2.imdecode(img_buffer, cv2.IMREAD_GRAYSCALE)
text.append(pytesseract.image_to_string(PILImage.fromarray(retval)))