I'm learning OCR using PyTesser
and Tesseract
. As the first milestone, I want to write a tool to recognize captcha that simply consists of some digits. I read some tutorials and wrote such a test program.
from pytesser.pytesser import *
from PIL import Image, ImageFilter, ImageEnhance
im = Image.open("test.tiff")
im = im.filter(ImageFilter.MedianFilter())
enhancer = ImageEnhance.Contrast(im)
im = enhancer.enhance(2)
im = im.convert('1')
text = image_to_string(im)
print "text={}".format(text)
I tested my code with the image below. But the result is 2(T?770
. And I've tested some other similar images as well, in 80% case the results are incorrect.
I'm not familiar with imaging processing. I've two questions here:
Is it possible to tell
PyTesser
to guess digits only?I think the image is quite easy for human to read. If it is so difficult for
PyTesser
to read digits only image, is there any alternatives can do a better OCR?
Any hints are very appreciated.