python - pytesseract 和 image.tif 文件

Question

我需要使用 pytesseract 将包含几页的 image.tif 转录为文本。我有下一个代码：

> From PIL import Image
> Import pytesseract
> Pytesseract.pytesseract.tesseract_cmd = 'C: / Program Files (x86) / Tesseract-
> OCR / tesseract '
> Print (pytesseract.image_to_string (Image.open ('CAMARA.tif'), lang = "spa"))

问题是只提取第一页。我怎样才能提取所有这些？

score 8 · Accepted Answer

convert()我可以通过调用以下方法来解决同样的问题

image = Image.open(imagePath).convert("RGBA")
text = pytesseract.image_to_string(image)
print(text)

score 0 · Accepted Answer

我只是偶然发现了同样的问题......你可以做的是直接调用 tesseract

# test.py
import subprocess

in_filename = 'file_0.tiff'
out_filename = 'out'
lang = 'spa'
subprocess.call(['tesseract', in_filename, '-l', lang, out_filename ])

将处理所有页面

$蟒蛇测试.py
Tesseract 开源 OCR 引擎 v4.0.0-beta.1 与 Leptonica
第 1 页
第2页
第 3 页

score 0 · Accepted Answer

我猜您只提到了一个图像“camara.tif”，首先您必须将所有 pdf 页面转换为图像，您可以查看此链接。

接下来使用 pytesseract 逐个循环图像以从图像中提取文本。

python - pytesseract 和 image.tif 文件

3 回答 3

Related

Reference