python - 将 HOCR 输出转换为字符串（用于正则表达式）的策略是什么？

Question

我正在使用 Pytesseract 并希望将 HOCR 输出转换为字符串。当然，这样的功能是在 Pytesseract 中实现的，但我想了解更多关于完成它的可能策略 thx

from pytesseract import image_to_pdf_or_hocr
hocr_output = image_to_pdf_or_hocr(image, extension='hocr')

score 0 · Accepted Answer

由于hOCR是 .xml 的一种，我们可以使用 .xml 解析器。

但首先我们需要将 tesseract 的二进制输出转换为 str：

from pytesseract import image_to_pdf_or_hocr

hocr_output = image_to_pdf_or_hocr(image, extension='hocr')
hocr = hocr_output.decode('utf-8')

现在我们可以使用xml.etree来解析它：

import xml.etree.ElementTree as ET

root = ET.fromstring(hocr)

xml.etree 为我们提供了一个文本迭代器，我们可以将其结果连接到单个字符串中：

text = ''.join(root.itertext())

1 回答 1