python - 当使用 Gimp 手动预处理图像时，使用 Tesseract-OCR 的图像到文本识别比我的 Python 代码更好

Question

我正在尝试用 Python 编写代码，以使用 Tesseract-OCR 进行手动图像预处理和识别。

手动处理：
为了手动识别单个图像的文本，我使用 Gimp 预处理图像并创建 TIF 图像。然后我将它提供给正确识别它的 Tesseract-OCR。

要使用 Gimp 预处理图像，我会这样做 -

将模式更改为 RGB / 灰度
菜单 -- 图像 -- 模式 -- RGB
阈值
菜单 -- 工具 -- 颜色工具 -- 阈值 -- 自动
将模式更改为索引
菜单 -- 图像 -- 模式 -- 索引
Resize / Scale to Width > 300px
Menu -- Image -- Scale image -- Width=300
另存为 Tif

然后我喂它 tesseract -

$ tesseract captcha.tif output -psm 6

我总是得到一个准确的结果。

Python 代码：
我尝试使用 OpenCV 和 Tesseract 复制上述过程 -

def binarize_image_using_opencv(captcha_path, binary_image_path='input-black-n-white.jpg'):
    im_gray = cv2.imread(captcha_path, cv2.CV_LOAD_IMAGE_GRAYSCALE)
    (thresh, im_bw) = cv2.threshold(im_gray, 128, 255, cv2.THRESH_BINARY | cv2.THRESH_OTSU)
    # although thresh is used below, gonna pick something suitable
    im_bw = cv2.threshold(im_gray, thresh, 255, cv2.THRESH_BINARY)[1]
    cv2.imwrite(binary_image_path, im_bw)

    return binary_image_path

def preprocess_image_using_opencv(captcha_path):
    bin_image_path = binarize_image_using_opencv(captcha_path)

    im_bin = Image.open(bin_image_path)
    basewidth = 300  # in pixels
    wpercent = (basewidth/float(im_bin.size[0]))
    hsize = int((float(im_bin.size[1])*float(wpercent)))
    big = im_bin.resize((basewidth, hsize), Image.NEAREST)

    # tesseract-ocr only works with TIF so save the bigger image in that format
    tif_file = "input-NEAREST.tif"
    big.save(tif_file)

    return tif_file

def get_captcha_text_from_captcha_image(captcha_path):

    # Preprocess the image befor OCR
    tif_file = preprocess_image_using_opencv(captcha_path)

    #   Perform OCR using tesseract-ocr library
    # OCR : Optical Character Recognition
    image = Image.open(tif_file)
    ocr_text = image_to_string(image, config="-psm 6")
    alphanumeric_text = ''.join(e for e in ocr_text)

    return alphanumeric_text

但我没有得到同样的准确性。我错过了什么？

更新 1：

原始图像
使用 Gimp 创建的 Tif 图像
我的 python 代码创建的 Tif 图像

更新 2：

此代码可在https://github.com/hussaintamboli/python-image-to-text获得

score 1 · Accepted Answer

如果输出仅与您的预期输出有最小程度的偏差（即您的评论中建议的额外的“，”等），请尝试将字符识别限制为您期望的字符集（例如字母数字）。

score 1 · Accepted Answer

您已经应用了简单的阈值。缺少的部分是您需要一张一张地阅读图像

对于每个个位数

1. 上采样
1. 添加边框

准确识别需要上采样。为图像添加边框将使数字居中。



8	8	乙	C	7	F

代码：

import cv2
import pytesseract

img = cv2.imread('Iv5BS.jpg')
gry = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)
thr = cv2.threshold(gry, 128, 255, cv2.THRESH_BINARY + cv2.THRESH_OTSU)[1]

(h_thr, w_thr) = thr.shape[:2]
s_idx = 2
e_idx = int(w_thr/6) - 20
result = ""

for _ in range(0, 6):
    crp = thr[5:int((6*h_thr)/7), s_idx:e_idx]
    (h_crp, w_crp) = crp.shape[:2]
    crp = cv2.resize(crp, (w_crp*2, h_crp*2))
    crp = cv2.copyMakeBorder(crp, 10, 10, 10, 10, cv2.BORDER_CONSTANT, value=255)
    s_idx = e_idx
    e_idx = s_idx + int(w_thr/6) - 7
    txt = pytesseract.image_to_string(crp, config="--psm 6")
    result += txt[0]
    cv2.imshow("crp", crp)
    cv2.waitKey(0)

print(result)

结果：

88BC7F

python - 当使用 Gimp 手动预处理图像时，使用 Tesseract-OCR 的图像到文本识别比我的 Python 代码更好

更新 1：

更新 2：

2 回答 2

Related

Reference