python - 将 hOCR 转换为 HTML 表格

Question

我正在寻找一种在 python 中实现的工具或想法，它将 hOCR 文件（由应用程序中的 tesseract 生成）转换为 html 表。这个想法是利用 hOCR 文件中的文本位置信息（在 bbox 属性中提供）来创建一个基于提供的位置的表。我提供了一个例子来解释上述想法：

我使用 SlideShare.net 中的这张图片作为我使用 tesseract 的应用程序的输入，我得到了下面的 hOCR/xml 文件作为输出。

hOCR 文件：

  <div class='ocr_page' id='page_2' title='image "sample_slide.jpg"; bbox 0 0 638 479; ppageno 1'>
   <div class='ocr_carea' id='block_1_1' title="bbox 0 0 638 479">
    <p class='ocr_par' dir='ltr' id='par_1' title="bbox 31 104 620 439">
     <span class='ocr_line' id='line_1' title="bbox 32 104 613 138"><span class='ocrx_word' id='word_1' title="bbox 32 105 119 131">done:</span> <span class='ocrx_word' id='word_2' title="bbox 132 104 262 138">working</span> <span class='ocrx_word' id='word_3' title="bbox 273 105 405 138">product,</span> <span class='ocrx_word' id='word_4' title="bbox 419 104 517 132">hotels</span> <span class='ocrx_word' id='word_5' title="bbox 528 104 613 132">listed</span> 
     </span>
     <span class='ocr_line' id='line_2' title="bbox 31 160 471 194"><span class='ocrx_word' id='word_6' title="bbox 31 164 62 187">to</span> <span class='ocrx_word' id='word_7' title="bbox 75 161 122 187">do:</span> <span class='ocrx_word' id='word_8' title="bbox 134 164 227 187">smart</span> <span class='ocrx_word' id='word_9' title="bbox 236 160 330 187">trafﬁc</span> <span class='ocrx_word' id='word_10' title="bbox 342 160 471 194">building</span> 
     </span>
     <span class='ocr_line' id='line_3' title="bbox 32 243 284 280"><span class='ocrx_word' id='word_11' title="bbox 32 243 128 280">seed</span> <span class='ocrx_word' id='word_12' title="bbox 148 243 284 280">round:</span> 
     </span>
     <span class='ocr_line' id='line_4' title="bbox 71 316 619 361"><span class='ocrx_word' id='word_13' title="bbox 71 321 156 356">CEO</span> <span class='ocrx_word' id='word_14' title="bbox 171 319 240 355">will</span> <span class='ocrx_word' id='word_15' title="bbox 260 321 384 356">invest</span> <span class='ocrx_word' id='word_16' title="bbox 517 316 619 361">$30k</span> 
     </span>
     <span class='ocr_line' id='line_5' title="bbox 75 392 620 439"><span class='ocrx_word' id='word_17' title="bbox 75 397 252 433">investor</span> <span class='ocrx_word' id='word_18' title="bbox 489 392 620 439">$120k</span> 
     </span>
    </p>
   </div>
  </div>

我需要的是根据下一个位置将 hOCR 文件转换为 html 表。预期的表应类似于此表。

表格单元格的大小和位置反映了 hOCR 文件中提供的信息。

图片来源：slideshare.net

score 2 · Accepted Answer

检查这个文件。我相信它描述了您需要的大部分（或全部）内容。从介绍：

本文档以类似 XML 的格式描述了 OCR 输出的各个方面的表示。也就是说，我们定义为一组包含文本和其他标签的标签，以及这些标签的属性。但是，由于我们所表示的内容是格式化的文本，然而，我们实际上并没有使用新的 XML 来表示；而是将表示嵌入到 XHTML（或 HTML）中，因为 XHTML 和 XHTML 处理已经定义了 OCR 输出表示的许多方面，否则这些方面需要额外的、单独的和临时的定义。

也可以使用 XSLT 将 XML 转换为 HTML。事实上，有一个项目计划这样做。

此外，这个项目（hocr-tools）可能会有所帮助。

最后请注意，Tesseract 的 FAQ 中提到了这一点：

使用配置文件“hocr”，tesseract 将产生符合 hocr 规范的 xhtml 输出

score 0 · Accepted Answer

这是一个如何将带有一些现有工具的 hocr 文件转换为表格的想法（对于原始问题来说也可能为时已晚）：

将 hocr 文件与图像文件一起使用并hocr-pdf从 hocr-tools 存储库创建一个 pdf，请参见https://github.com/tmbdev/hocr-tools#hocr-pdf
使用 tabula https://github.com/tabulapdf/tabula从pdf中提取表格数据
将 CSV 数据转换为 HTML 表格（这个任务应该有很多工具）

仅需要第一步，因为 tabula 仅适用于 pdf。第二步是 IMO 从视觉信息中提取表格数据的主要挑战，如果您想了解有关算法方法的一些想法，检查那里的细节可能也很有趣。

python - 将 hOCR 转换为 HTML 表格

2 回答 2

Related

Reference