algorithm - 基于字典的关键字检测

Question

我想从扫描文档中识别可能存在 OCR 错误的关键字。基于关键字列表和每个字符的置信度值及其扫描文档的替代项，我如何开发一种算法来可靠地识别关键字？

对于 OCR，我使用的是 Tesseract，它为每个字符及其最佳选择提供置信度值。因此，对于每个单词，我都有一个这样的列表：

 Word=order
 [0] o (93%) [alts: 0 (90%), c (83%), e (82%)]
 [1] r (96%)
 [2] d (96%)
 [3] e (90%) [alts: a (75%)]
 [4] r (95%)

另一个包括 OCR 错误的示例：

 Word=PaYmeHI (Payment would be correct)
 [0] P (81%) [alts: p (78%), D (68%)]
 [1] a (76%) [alts: 3 (73%), e (63%), ö (61%)]
 [2] Y (87%) [alts: V (86%)]
 [3] m (83%) 
 [4] E (71%) [alts: € (79%), 9 (72%), % (67%), e (65%), B (64%), G (64%)]
 [5] H (76%) [alts: n (83%), D (74%), N (70%), ü (69%)]
 [6] I (75%) [alts: t (69%), { (67%), 1 (65%), i (61%)]

如您所见，tesseract 并不总是选择百分比最高的结果 (4, 5)。

从浏览结果来看，大多数具有 90% 以上值的字符都是正确的。但是，坏结果不一定包含替代列表中的正确字符（请参阅 [2]，它应该是小写的y.

目前我正在通过使用 Levenshtein 距离和字符串长度来获取候选人列表。此外，我排除了关键字 where lev2 > 3。这只是硬编码，因为我仍在寻找确定阈值的好方法。

      int lev = getLevenshteinDistance(keyword, s);
      int lev2 = getLevenshteinDistance(keyword.toLower(), s.toLower());
      int len = Math.abs(keyword.length - s.length); 
      int x = lev + lev2 + len;

我正在对关键字列表进行排序x，以获得最可能的结果。

所以首先，我正在寻找一种方法来根据 OCR 结果和字符串长度确定一个好的阈值。短字符串需要比大字符串更低的阈值和可靠的 OCR 结果。以上面的例子为例：对于单词 order lev2 <= 1，就足够了，而 for paymentat leastlev2 <= 3应该计算。

其次，我怎样才能确定剩下的候选人之一是否真的与这个词匹配？如果lev == 0所有字符的置信度值都是>= 90显而易见的。但是考虑到糟糕的 OCR 结果，我可以开发什么算法来包括替代 OCR 选择？

score 2 · Accepted Answer

I have been thinking about something similar for a project of mine; I haven't got any good answers yet, but here are some thoughts:

I think the question we're trying to answer is this:

Does this document (the OCR result) contain the term 'order'?

Idea 1

The OCR documents contains terms with some 'score' ...

So in your example, the document contains:

order with score = sum(93,96,96,90,95)/5 = 94
0rder with score = sum(90,96,96,90,95)/5 = 93
crder with score = sum(83,96,96,90,95)/5 = 92
erder with score = sum(82,96,96,90,95)/5 = 91
ordar with score = sum(93,96,96,75,95)/5 = 91
0rdar with score = sum(90,96,96,75,95)/5 = 90
crdar with score = sum(83,96,96,75,95)/5 = 89
erdar with score = sum(82,96,96,75,95)/5 = 88

Now that we have a score for each candidate, we can get a score for document, given some query (using levenshtein distance for now...)

score for doc given keyword "order" is the average of

(3-min(lev(order, order),3)*0.33) * 94,
(3-min(lev(0rder, order),3)*0.33) * 93,
(3-min(lev(crder, order),3)*0.33) * 92,
...,
...

If this score is higher than some threshold the document is deemed to match 'order'

Idea 2

We can improve the OCR results with some language models

Compute score for each term as follows:

term        | ocr_score   |ngram score            |combined score
------------+-------------+-----------------------+---------------
order   | 94          |score(ord, rde, der)   |ocr*ngram
0rder   | 93          |score(0rd, rde, der)   |ocr*ngram
crder   | 92          |score(crd, rde, der)   |ocr*ngram
erder   | 91          |score(erd, rde, der)   |...
ordar   | 91          |score(ord, rda, der)   |...
0rdar   | 90          |score(0rd, rda, der)   |...
crdar   | 89          |score(crd, rda, der)   |...
erdar   | 88          |score(erd, rda, der)   |...

Where score(ord) = trigram probability of 'ord'

Google Books for example gives trigram probability for any trigrams (see: http://books.google.com/ngrams/chart?content=ord&corpus=0&smoothing=3&year_start=1970&year_end=2000)

We could also compute unigram, bigram, quadgrams ...; then we can compute score based on "unigram" probability of words themselves; bigrams of words and so on...; then we could also apply some purely analytic language models

So we now have more scores for each 'candidate term' and we combine them all with some weights for each score to get a combined score for the term

Idea 3

Ok, so the above leads to an explosion of terms / scores ... which is compute intensive; so we use some magic to build a probabilistic DFA for each term based on ideas 1 & 2. The document now contains probabilistic DFAs rather than terms. The Lucene guys have done some work to build Levenshtein DFAs and allow checking if DFA1 and DFA2 match quickly...

score 1 · Accepted Answer

首先，我认为你的程序给你的是 P(observation|symbol)，而不是 P(symbol|observation)。P(symbol|observation) \proportional P(observation|symbol)*P(symbol) 。

例如，对于支付中的那个 e，虽然观察到的模式给出符号的概率对于欧元最高，但观察到欧元的概率非常小。因此，它很可能是“e”，而不是欧元。

因此，我的建议是对所有可能的单词求和 log( P(observation|symbol)*P(symbol) ) 并选择使该值最大化的单词。

此外，您可以通过使用上下文来使用更准确的估计，而不是使用 P(symbol)。

algorithm - 基于字典的关键字检测

2 回答 2

Related

Reference