pdf - 如何获取 tesseract 为 pdf 文件创建的隐藏文本布局？

翻译自：https://stackoverflow.com/questions/35841255 2016-03-07T10:24:34.283

427 次

2

我对ocr没有太多经验。这是我尝试的方法：

tesseract -l eng -psm 1 image_str007_0001.jpg image_str007_tess pdf

结果是结构完美的隐藏文本布局 - 搜索 pdf 时，单词在它们的确切位置。我的问题是：我可以将此布局作为文件（hocr 或 html）吗？（首选配置参数，而不是 API。）

我试过的：
tesseract -l eng -psm 1 image_str007_0001.jpg 输出 hocr

和

hocr2pdf -i image_str007_001 -o output.pdf < output.hocr

在文件 output.pdf 中，在搜索文本时，这些单词的排列非常错误。命令 2. 是否不适合创建 tesseract hocr 布局文件，或者 hocr2pdf 应用程序无法正确创建 pdf？

0 回答 0