python - Python - Pytesseract 从图像中提取不正确的文本

Question

我在 Python 中使用以下代码从图像中提取文本，

import cv2
import numpy as np
import pytesseract
from PIL import Image

# Path of working folder on Disk
src_path = "<dir path>"

def get_string(img_path):
    # Read image with opencv
    img = cv2.imread(img_path)

    # Convert to gray
    img = cv2.cvtColor(img, cv2.COLOR_BGR2GRAY)

    # Apply dilation and erosion to remove some noise
    kernel = np.ones((1, 1), np.uint8)
    img = cv2.dilate(img, kernel, iterations=1)
    img = cv2.erode(img, kernel, iterations=1)

    # Write image after removed noise
    cv2.imwrite(src_path + "removed_noise.png", img)

    #  Apply threshold to get image with only black and white
    #img = cv2.adaptiveThreshold(img, 255, cv2.ADAPTIVE_THRESH_GAUSSIAN_C, cv2.THRESH_BINARY, 31, 2)

    # Write the image after apply opencv to do some ...
    
    cv2.imwrite(src_path + "thres.png", img)

    # Recognize text with tesseract for python
    result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

    # Remove template file
    #os.remove(temp)

    return result


print '--- Start recognize text from image ---'
print get_string(src_path + "test.jpg")

print "------ Done -------"

但是输出不正确..输入文件是，

收到的输出是“0001”而不是“D001”

收到的输出是“3001”而不是“B001”

从图像中检索正确字符所需的代码更改是什么，以及训练 pytesseract 为图像中的所有字体类型返回正确的字符[包括粗体字符]

score 3 · Accepted Answer

@Maaaaa 指出了 Tessearact 错误识别文本的确切原因。

但是您仍然可以通过对 tesseract 输出应用一些后处理步骤来改进最终输出。如果有帮助，您可以考虑以下几点并使用它们：

尝试在 Tesseract 输入参数中禁用字典检查功能。
使用数据集中的启发式信息。从有问题的给定示例图像中，我猜每个单词/序列的第一个字符是一个字母，因此您可以根据您的数据集将输出中的第一个数字替换为最可能的字母，例如“0”可以替换为 D 所以“ 0001' -> 'D001'，其他情况也类似。
Tesseract 还提供字符级别识别置信度值，因此使用该信息将字符替换为具有最高置信度值的字符。

score 1 · Accepted Answer

在下面的行中尝试不同的配置参数

result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"))

如下图所示：

result = pytesseract.image_to_string(Image.open(img_path))#src_path+ "thres.png"), config='--psm 1 --oem 3')

尝试更改psm值并比较结果

- 祝你好运 -

python - Python - Pytesseract 从图像中提取不正确的文本

2 回答 2

Related

Reference