javascript - 将 OCRed 非结构化文本转换为正确文本

Question

我正在使用 MicrosoftMODI对VB6图像进行 OCR。（我知道其他 OCR 工具，如 tesseract 等，但我发现 MODI 比其他工具更准确）

要 OCRed 的图像是这样的

在此处输入图像描述

而且，我在 OCR 之后得到的文本如下所示

Text1
Text2
Text3
Number1
Number2
Number3

这里的问题是没有维护相反列的相应文本。如何将 Number1 与 Text1 映射？

我只能想到这样的解决方案。

MODI 像这样提供所有 OCRed 单词的坐标

LeftPos = Img.Layout.Words(0).Rects(0).Left
TopPos = Img.Layout.Words(0).Rects(0).Top

因此，为了对齐同一行中的单词，我们可以匹配每个单词的 TopPos，然后按 LeftPos 对它们进行排序。我们将获得完整的生产线。所以我遍历了所有的单词，并将它们的文本以及 left 和 top 存储在一个 mysql 表中。然后运行此查询

SELECT group_concat(word ORDER BY `left` SEPARATOR ' ')
FROM test_copy
GROUP BY `top`

我的问题是，每个单词的最高位置并不完全相同，显然会有几个像素差异。

我尝试添加DIV 5, 用于合并 5 像素范围内的单词，但这在某些情况下不起作用。我还尝试在 node.js 中通过计算每个单词的容差，然后按 LeftPos 排序，但我仍然觉得这不是最好的方法。

更新： js 代码完成了这项工作，但除了 Number1 有 5 个像素差异而 Text2 在该行中没有对应的情况。

有没有更好的主意来做到这一点？

score 4 · Accepted Answer

我不是 100% 确定你如何识别“左”列中的那些单词，但是一旦你确定了那个单词，你可以通过投影不仅是顶部坐标而且整个矩形穿过 (顶部和底部）。确定与其他词的重叠（交集）。请注意下面用红色标记的区域。

水平投影

这是您可以用来检测某物是否在同一行中的容差。如果某物仅重叠一个像素，则它可能来自较低或较高的线。但是，如果它与高度 Text1 的 50% 或更多重叠，那么它很可能在同一行。

基于顶部和底部坐标查找“行”中所有单词的示例 SQL

select 
    word.id, word.Top, word.Left, word.Right, word.Bottom 
from 
    word
where 
    (word.Top >= @leftColWordTop and word.Top <= @leftColWordBottom)
    or (word.Bottom >= @leftColWordTop  and word.Bottom <= @leftColWordBottom)

用于计算行的示例伪 VB6 代码。

'assume words is a collection of WordInfo objects with an Id, Top, 
'   Left, Bottom, Right properties filled in, and a LineAnchorWordId 
'   property that has not been set yet.

'get the words in left-to-right order
wordsLeftToRight = SortLeftToRight(words) 

'also get the words in top-to-bottom order
wordsTopToBottom = SortTopToBottom(words) 

'pass through identifying a line "anchor", that being the left-most 
'   word that starts (and defines) a line
for each anchorWord in wordsLeftToRight

    'check if the word has been mapped to aline yet by checking if 
    '   its anchor property has been set yet.  This assumes 0 is not 
    '   a valid id, use -1 instead if needed
    if anchorWord.LineAnchorWordId = 0 then 

        'not locate every word on this line, as bounded by the 
        '   anchorWord.  every word determined to be on this line 
        '   gets its LineAnchorWordId property set to the Id of the 
        '   anchorWord
        for each lineWord in wordsTopToBottom

            if lineWord.Bottom < anchorWord.Top Then

                'skip it,it is above the line (but keep searching down
                '   because we haven't reached the anchorWord location yet)

            else if lineWord.Top > anchorWord.Bottom Then

                'skip it,it is below the line, and exit the search 
                '   early since all the rest will also be below the line
                exit for

            else if OverlapsWithinTolerance(anchorWord, lineWord) then

                lineWord.LineAnchorWordId = anchorWord.Id

            endif

        next

    end if

next anchorWord

'at this point, every word has been assigned a LineAnchorWordId, 
'   and every word on the same line will have a matching LineAnchorWordId
'   value.  If stored in a DB you can now group them by LineAnchorWordId 
' and sort them by their Left coord to get your output.

javascript - 将 OCRed 非结构化文本转换为正确文本

1 回答 1

Related

Reference