ocr - Reliably extracting identity fields from scanned documents / images?

Question

I have to pull two pre-printed (not hand-written) fields out of a paper form, such that it can be automatically routed after being scanned. The fields contain batch and item identifiers, like "GG-9192" or "EPN/245G".

I've tried the following software:

Tesseract-OCR
Cuneiform
Canon ImageRunner built-in OCR
Asprise OCR Java API (demo)

I've tried the following settings:

Scanning at resolutions of 300dpi and 600dpi
Tried different fonts, including OCR-A and OCR-B.

In all cases output was pretty much all over the place. I can kick back documents for which I can't properly extract the necessary information, but I'm thinking it's going to be at least half of them. I considered some sort of fuzzy logic based on known values in a database, but sometimes these identifiers can differ by a single character, like "123G" and "123C".

Is this a lost cause? Perhaps OCR just isn't mature enough to handle a requirement of this nature? What other techniques might you recommend? Barcodes?

Edit: the containing application is in Java, so any recommendations for which there are free or cheap Java-based APIs for would help.

Edit 2: if anyone is interested...without any special tuning, Cuneiform for Linux and the Canon ImageRunner worked best, with Tesserect-OCR and Asprise Java API producing the worst results...none of the four was acceptable for anything but standard document search grade OCR. I'm beginning to think that this isn't going to work out.

score 2 · Accepted Answer

如果您可以控制这些字段，为什么首先要使用人类可读的格式？对于扫描，它似乎是一个二维码，或者类似的东西是最好的。它被标记为方向，并具有一些内置的纠错功能。

http://en.wikipedia.org/wiki/QR_Code

score 2 · Accepted Answer

我从 Tomato 的建议开始挖掘产品。我试过 ABBYY 和 CVISION。两者都有可以自动化 OCR 的产品：

此外，ABBYY 有适用于各种平台的 SDK，而 CVISION 有一个似乎至少可以与 VB/VC++ 一起使用的 SDK。

我还没有尝试过任何一个 SDK，并且不确定我的项目是否需要它。我所需要的只是可以从中提取文本的 PDF。然而，我确实尝试了 CVISION 的服务器产品，并在其最准确的设置上使用 OCR，它运行得非常好。我还没有尝试过 ABBYY 的服务器产品，因为我必须通过经销商才能获得试用。我正在这样做，但如果它开始变得烦人，我可能会选择 CVISION。我确实尝试过 ABBYY 的 FineReader 独立产品，效果很好，所以我认为他们的服务器产品也会。

ocr - Reliably extracting identity fields from scanned documents / images?

2 回答 2

Related

Reference