xml - How to embed external OCR into existing PDF?

Question

I have a set of images over which I run an OCR application. This process results in a XML file with character offsets. Then I convert the images to PDF using Acrobat 9. Now, I would like to add the XML file information as an invisible text layer into the PDF in order to achieve a searchable PDF. Is there an easy and free way?

Some details:

I don't want to use Acrobat's OCR functionality;
The OCR process results in a XML file which contains elements like:

<line baseline="1049" l="158" t="1012" r="1196" b="1060">This is a sample line of text from an image</line>

Update: it may be possible doing what I want in a different way. Supposing there is already a PDF file generated from a set of images, and which already contains OCRed text. Would it be possible to (maybe programmatically) access just the image of each page, process it (e.g., converting it to monochrome), and save it back to the PDF file? If yes, then the OCRed text would not be lost.

[Should I put this update into a separate question?]

score 1 · Accepted Answer

对于您关于在不丢失隐藏层的情况下处理 PDF 文件的后续问题：我相信Ghostscript能够做到这一点。例如，以下命令应将 PDF 转换为灰度：

gs -q -dNOPAUSE -dBATCH -sDEVICE=pdfwrite -dColorConversionStrategy=/Gray -dProcessColorModel=/DeviceGray -sOutputFile=output.pdf input.pdf

score -1 · Accepted Answer

如果您只想将现有的 pdf 转换为灰度，请尝试Imagemagick：

convert foo.pdf -colorspace Gray -compress zip gray.pdf

我认为这不会改变您的 pdf 中的任何其他属性。

xml - How to embed external OCR into existing PDF?

2 回答 2

Related

Reference