-1

我有一组 pdf,我想从中处理(VB.NET)只有那些非文本可搜索的,你能告诉我该怎么做吗?

4

2 回答 2

2

一般来说,这样做的方法是打开每个页面并翻录内容流,然后查看是否执行了将文本放置在页面上的任何文本操作符。

让我解释一下这意味着什么——PDF 内容是一种小型 RPN 语言,其中包含以某种方式标记页面的操作。例如,您可能会看到如下内容:

BT 72 400 Td /F0 12 Tf (Throatwarbler Mangrove) Tj ET

意思是:

  1. 开始一个文本区域
  2. 以 PDF 单位将文本基线的位置设置为 (72, 400)
  3. 将字体设置为当前页面字体资源字典中名为 F0 的资源
  4. 画出文字“喉莺红树林”
  5. 结束文本区域

所以你可以试试捷径

  1. 我的页面的资源字典是否包含任何字体?

这在某些情况下会失败,因为某些 PDF 生成工具将字体放入资源字典中并且不使用它们(误报)。如果页面内容包含包含文本的 Form XObject(假阴性),它也会失败。

  1. 我的页面内容流是否有 BT/ET 运算符?

This will get you closer, but will fail if there is not content in them (false positive) or if they're not present, but there's a Form XObject which contains text (false negative).

So really, the thing to do is to execute the entire page's content stream, including recursing on all XObject to look for text operators.

Now, there's another approach that you can take using my Atalasoft's software (disclaimer, I work for Atalasoft and have written most of the PDF handling code, I also worked on Acrobat versions 1-4). Instead of asking, does this page contain any text, you can ask "does this page contain only a single image?"

bool allPagesImages = true;
using (Document doc = new Document(inputStream))
{
    foreach (Page p in doc.Pages)
    {
        if (!p.SingleImageOnly)
        {
            allPagesImages = false;
            break;
        }
    }
}

Which will leave allPagesImages with a pretty decent indication that each page is all images, which if you're looking to OCR is the non-searchable documents, is probably what you really want.

The down side is that this will be a very high price for a single predicate, but it also gets you a PDF rasterizer and the ability to extract the images directly out of the file.

Now, I have no doubt that a solid engineer could work their way through the PDF spec and write some code to extend iTextPdfSharp to do this task I think that if I sat down with it, I might be able to write that predicate in a few days, but I already know most of the PDF spec. So it might take you more like two weeks to a month. So your choice.

于 2013-04-23T12:36:57.657 回答
0

我认为此选项可能是您的考虑因素,尽管我尚未测试代码,但我认为可以通过阅读您要继续的每个 PDF 文件的属性来完成。

您可以查看此链接:

http://www.codeguru.com/columns/vb/manipulating-pdf-files-with-itextsharp-and-vb.net-2012.htm

您必须在继续之后立即阅读生产者属性。这只是一个例子。但我的建议请在此处包含您的代码,以便我们尝试帮助您。祝福你

于 2013-04-23T11:00:54.050 回答