python - 如何使用 PDFrw 从 PDF 中提取图像

Question

我正在使用PDFrw其中一个示例来提取 PFD 文件中的唯一图像并将该图像保存到 PNG 或 JPEG 文件中。

代码对我来说太难理解了，我应该传递什么参数find_objects？

from pdfrw.objects import PdfDict, PdfArray, PdfName
from pdfrw.pdfwriter import user_fmt


def find_objects(source, valid_types=(PdfName.XObject, None),
                 valid_subtypes=(PdfName.Form, PdfName.Image),
                 no_follow=(PdfName.Parent,),
                 isinstance=isinstance, id=id, sorted=sorted,
                 reversed=reversed, PdfDict=PdfDict):
    '''
        Find all the objects of a particular kind in a document
        or array.  Defaults to looking for Form and Image XObjects.
        This could be done recursively, but some PDFs
        are quite deeply nested, so we do it without
        recursion.
        Note that we don't know exactly where things appear on pages,
        but we aim for a sort order that is (a) mostly in document order,
        and (b) reproducible.  For arrays, objects are processed in
        array order, and for dicts, they are processed in key order.
    '''
    container = (PdfDict, PdfArray)

    # Allow passing a list of pages, or a dict
    if isinstance(source, PdfDict):
        source = [source]
    else:
        source = list(source)

    visited = set()
    source.reverse()
    while source:
        obj = source.pop()
        if not isinstance(obj, container):
            continue
        myid = id(obj)
        if myid in visited:
            continue
        visited.add(myid)
        if isinstance(obj, PdfDict):
            if obj.Type in valid_types and obj.Subtype in valid_subtypes:
                yield obj
            obj = [y for (x, y) in sorted(obj.iteritems())
                   if x not in no_follow]
        else:
            # TODO: This forces resolution of any indirect objects in
            # the array.  It may not be necessary.  Don't know if
            # reversed() does any voodoo underneath the hood.
            # It's cheap enough for now, but might be removeable.
            obj and obj[0]
        source.extend(reversed(obj))


find_objects('target.pdf')

score 3 · Accepted Answer

我是 pdfrw 作者，我还没有编写代码来做到这一点:(。

通常，如果我需要这样做，我会使用inkscape。它在命令行模式下工作得很好。

pdfrw 作为反向路径的一部分可能很有用。img2pdf.py 是一个很棒的工具，可以将 PDF 图像放在页面上，而 pdfrw 可以将这些图像（一旦它们在 PDF 中）添加到其他页面。

编辑添加：

pdfrw实际上对于提取图像很有用，因为它可以将所有图像放入一个新的 PDF 中，每页一张图像。请参阅示例目录中的 extract.py。

它不能（还？？？）然后将图像提取为 JPEG，但使用 inkscape 是一项简单的任务，它甚至可以让您轻松裁剪到实际图像大小。

python - 如何使用 PDFrw 从 PDF 中提取图像

1 回答 1

Related

Reference