python - 图像到文本 - 在 python 2.7 中删除非 ascii 字符

Question

我正在使用 pytesser 对一个小图像进行 OCR 并从中获取一个字符串：

image= Image.open(ImagePath)
text = image_to_string(image)
print text

但是，pytesser 有时喜欢识别并返回非 ascii 字符。当我现在想打印我刚刚识别的内容时，就会出现问题。在 python 2.7（这是我正在使用的）中，程序崩溃了。

有什么方法可以让 pytesser 不返回任何非 ascii 字符？也许您可以在 tesseract OCR 中更改某些内容？

或者，是否有某种方法可以测试非 ascii 字符的字符串（不会使程序崩溃），然后不打印该行？

有些人会建议使用 python 3.4，但根据我的研究，pytesser 似乎无法使用它：Python 3.4 中的 Pytesser: name 'image_to_string' is not defined?

score 4 · Accepted Answer

我会选择Unidecode。该库将非 ASCII 字符转换为最相似的 ASCII 表示。

import unidecode
image = Image.open(ImagePath)
text = image_to_string(image)
print unidecode(text)

它应该可以完美运行！

score 0 · Accepted Answer

有什么方法可以让 pytesser 不返回任何非 ascii 字符？

您可以使用选项限制 tesseract 可识别的字符tessedit_char_whitelist。

例如：

import string
char_whitelist = string.digits
char_whitelist += string.ascii_lowercase
char_whitelist += string.ascii_uppercase
image= Image.open(ImagePath)
text = image_to_string(image,
    config="-c tessedit_char_whitelist=%s_-." % char_whitelist)
print text

另见：https ://github.com/tesseract-ocr/tesseract/wiki/FAQ-Old#how-do-i-recognize-only-digits

python - 图像到文本 - 在 python 2.7 中删除非 ascii 字符

2 回答 2

Related

Reference