python - 在python中从PDF中提取图像而不重新采样？

Question

如何以原始分辨率和格式从 pdf 文档中提取所有图像？（意味着将 tiff 提取为 tiff，将 jpeg 提取为 jpeg 等，并且无需重新采样）。布局并不重要，我不在乎源图像是否位于页面上。

我正在使用 python 2.7，但如果需要可以使用 3.x。

score 79 · Accepted Answer

您可以使用模块 PyMuPDF。这会将所有图像输出为 .png 文件，但开箱即用且速度很快。

import fitz
doc = fitz.open("file.pdf")
for i in range(len(doc)):
    for img in doc.getPageImageList(i):
        xref = img[0]
        pix = fitz.Pixmap(doc, xref)
        if pix.n < 5:       # this is GRAY or RGB
            pix.writePNG("p%s-%s.png" % (i, xref))
        else:               # CMYK: convert to RGB first
            pix1 = fitz.Pixmap(fitz.csRGB, pix)
            pix1.writePNG("p%s-%s.png" % (i, xref))
            pix1 = None
        pix = None

更多资源请看这里

score 49 · Accepted Answer

在带有 PyPDF2 和 Pillow 库的 Python 中，这很简单：

import PyPDF2

from PIL import Image

if __name__ == '__main__':
    input1 = PyPDF2.PdfFileReader(open("input.pdf", "rb"))
    page0 = input1.getPage(0)
    xObject = page0['/Resources']['/XObject'].getObject()

    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            if xObject[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
                img.save(obj[1:] + ".png")
            elif xObject[obj]['/Filter'] == '/DCTDecode':
                img = open(obj[1:] + ".jpg", "wb")
                img.write(data)
                img.close()
            elif xObject[obj]['/Filter'] == '/JPXDecode':
                img = open(obj[1:] + ".jp2", "wb")
                img.write(data)
                img.close()

score 33 · Accepted Answer

通常在 PDF 中，图像只是按原样存储。例如，插入 jpg 的 PDF 将在中间某处有一系列字节，当提取时该字节是有效的 jpg 文件。您可以使用它来非常简单地从 PDF 中提取字节范围。我前段时间写过这个，带有示例代码：Extracting JPGs from PDFs。

score 25 · Accepted Answer

在 Python 中使用 PyPDF2 进行 CCITTFaxDecode 过滤器：

import PyPDF2
import struct

"""
Links:
PDF format: http://www.adobe.com/content/dam/Adobe/en/devnet/acrobat/pdfs/pdf_reference_1-7.pdf
CCITT Group 4: https://www.itu.int/rec/dologin_pub.asp?lang=e&id=T-REC-T.6-198811-I!!PDF-E&type=items
Extract images from pdf: http://stackoverflow.com/questions/2693820/extract-images-from-pdf-without-resampling-in-python
Extract images coded with CCITTFaxDecode in .net: http://stackoverflow.com/questions/2641770/extracting-image-from-pdf-with-ccittfaxdecode-filter
TIFF format and tags: http://www.awaresystems.be/imaging/tiff/faq.html
"""


def tiff_header_for_CCITT(width, height, img_size, CCITT_group=4):
    tiff_header_struct = '<' + '2s' + 'h' + 'l' + 'h' + 'hhll' * 8 + 'h'
    return struct.pack(tiff_header_struct,
                       b'II',  # Byte order indication: Little indian
                       42,  # Version number (always 42)
                       8,  # Offset to first IFD
                       8,  # Number of tags in IFD
                       256, 4, 1, width,  # ImageWidth, LONG, 1, width
                       257, 4, 1, height,  # ImageLength, LONG, 1, lenght
                       258, 3, 1, 1,  # BitsPerSample, SHORT, 1, 1
                       259, 3, 1, CCITT_group,  # Compression, SHORT, 1, 4 = CCITT Group 4 fax encoding
                       262, 3, 1, 0,  # Threshholding, SHORT, 1, 0 = WhiteIsZero
                       273, 4, 1, struct.calcsize(tiff_header_struct),  # StripOffsets, LONG, 1, len of header
                       278, 4, 1, height,  # RowsPerStrip, LONG, 1, lenght
                       279, 4, 1, img_size,  # StripByteCounts, LONG, 1, size of image
                       0  # last IFD
                       )

pdf_filename = 'scan.pdf'
pdf_file = open(pdf_filename, 'rb')
cond_scan_reader = PyPDF2.PdfFileReader(pdf_file)
for i in range(0, cond_scan_reader.getNumPages()):
    page = cond_scan_reader.getPage(i)
    xObject = page['/Resources']['/XObject'].getObject()
    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            """
            The  CCITTFaxDecode filter decodes image data that has been encoded using
            either Group 3 or Group 4 CCITT facsimile (fax) encoding. CCITT encoding is
            designed to achieve efficient compression of monochrome (1 bit per pixel) image
            data at relatively low resolutions, and so is useful only for bitmap image data, not
            for color images, grayscale images, or general data.

            K < 0 --- Pure two-dimensional encoding (Group 4)
            K = 0 --- Pure one-dimensional encoding (Group 3, 1-D)
            K > 0 --- Mixed one- and two-dimensional encoding (Group 3, 2-D)
            """
            if xObject[obj]['/Filter'] == '/CCITTFaxDecode':
                if xObject[obj]['/DecodeParms']['/K'] == -1:
                    CCITT_group = 4
                else:
                    CCITT_group = 3
                width = xObject[obj]['/Width']
                height = xObject[obj]['/Height']
                data = xObject[obj]._data  # sorry, getData() does not work for CCITTFaxDecode
                img_size = len(data)
                tiff_header = tiff_header_for_CCITT(width, height, img_size, CCITT_group)
                img_name = obj[1:] + '.tiff'
                with open(img_name, 'wb') as img_file:
                    img_file.write(tiff_header + data)
                #
                # import io
                # from PIL import Image
                # im = Image.open(io.BytesIO(tiff_header + data))
pdf_file.close()

score 18 · Accepted Answer

Libpoppler 附带了一个名为“pdfimages”的工具，可以做到这一点。

（在 ubuntu 系统上，它位于 poppler-utils 包中）

http://poppler.freedesktop.org/

http://en.wikipedia.org/wiki/Pdfimages

Windows 二进制文件：http: //blog.alivate.com.au/poppler-windows/

score 10 · Accepted Answer

我更喜欢 minecart，因为它非常易于使用。下面的片段显示了如何从 pdf 中提取图像：

#pip install minecart
import minecart

pdffile = open('Invoices.pdf', 'rb')
doc = minecart.Document(pdffile)

page = doc.get_page(0) # getting a single page

#iterating through all pages
for page in doc.iter_pages():
    im = page.images[0].as_pil()  # requires pillow
    display(im)

score 7 · Accepted Answer

这是我 2019 年的版本，它递归地从 PDF 中获取所有图像并使用 PIL 读取它们。与 Python 2/3 兼容。我也发现有时候PDF中的图片可能会被zlib压缩，所以我的代码支持解压。

#!/usr/bin/env python3
try:
    from StringIO import StringIO
except ImportError:
    from io import BytesIO as StringIO
from PIL import Image
from PyPDF2 import PdfFileReader, generic
import zlib


def get_color_mode(obj):

    try:
        cspace = obj['/ColorSpace']
    except KeyError:
        return None

    if cspace == '/DeviceRGB':
        return "RGB"
    elif cspace == '/DeviceCMYK':
        return "CMYK"
    elif cspace == '/DeviceGray':
        return "P"

    if isinstance(cspace, generic.ArrayObject) and cspace[0] == '/ICCBased':
        color_map = obj['/ColorSpace'][1].getObject()['/N']
        if color_map == 1:
            return "P"
        elif color_map == 3:
            return "RGB"
        elif color_map == 4:
            return "CMYK"


def get_object_images(x_obj):
    images = []
    for obj_name in x_obj:
        sub_obj = x_obj[obj_name]

        if '/Resources' in sub_obj and '/XObject' in sub_obj['/Resources']:
            images += get_object_images(sub_obj['/Resources']['/XObject'].getObject())

        elif sub_obj['/Subtype'] == '/Image':
            zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
            if zlib_compressed:
               sub_obj._data = zlib.decompress(sub_obj._data)

            images.append((
                get_color_mode(sub_obj),
                (sub_obj['/Width'], sub_obj['/Height']),
                sub_obj._data
            ))

    return images


def get_pdf_images(pdf_fp):
    images = []
    try:
        pdf_in = PdfFileReader(open(pdf_fp, "rb"))
    except:
        return images

    for p_n in range(pdf_in.numPages):

        page = pdf_in.getPage(p_n)

        try:
            page_x_obj = page['/Resources']['/XObject'].getObject()
        except KeyError:
            continue

        images += get_object_images(page_x_obj)

    return images


if __name__ == "__main__":

    pdf_fp = "test.pdf"

    for image in get_pdf_images(pdf_fp):
        (mode, size, data) = image
        try:
            img = Image.open(StringIO(data))
        except Exception as e:
            print ("Failed to read image with PIL: {}".format(e))
            continue
        # Do whatever you want with the image

score 6 · Accepted Answer

我从@sylvain 的代码开始有一些缺陷，比如NotImplementedError: unsupported filter /DCTDecodegetData 异常，或者代码在某些页面中找不到图像，因为它们比页面更深。

有我的代码：

import PyPDF2

from PIL import Image

import sys
from os import path
import warnings
warnings.filterwarnings("ignore")

number = 0

def recurse(page, xObject):
    global number

    xObject = xObject['/Resources']['/XObject'].getObject()

    for obj in xObject:

        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj]._data
            if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                mode = "RGB"
            else:
                mode = "P"

            imagename = "%s - p. %s - %s"%(abspath[:-4], p, obj[1:])

            if xObject[obj]['/Filter'] == '/FlateDecode':
                img = Image.frombytes(mode, size, data)
                img.save(imagename + ".png")
                number += 1
            elif xObject[obj]['/Filter'] == '/DCTDecode':
                img = open(imagename + ".jpg", "wb")
                img.write(data)
                img.close()
                number += 1
            elif xObject[obj]['/Filter'] == '/JPXDecode':
                img = open(imagename + ".jp2", "wb")
                img.write(data)
                img.close()
                number += 1
        else:
            recurse(page, xObject[obj])



try:
    _, filename, *pages = sys.argv
    *pages, = map(int, pages)
    abspath = path.abspath(filename)
except BaseException:
    print('Usage :\nPDF_extract_images file.pdf page1 page2 page3 …')
    sys.exit()


file = PyPDF2.PdfFileReader(open(filename, "rb"))

for p in pages:    
    page0 = file.getPage(p-1)
    recurse(p, page0)

print('%s extracted images'% number)

score 5 · Accepted Answer

PikePDF可以用很少的代码做到这一点：

from pikepdf import Pdf, PdfImage

filename = "sample-in.pdf"
example = Pdf.open(filename)

for i, page in enumerate(example.pages):
    for j, (name, raw_image) in enumerate(page.images.items()):
        image = PdfImage(raw_image)
        out = image.extract_to(fileprefix=f"{filename}-page{i:03}-img{j:03}")

extract_to将根据图像在 PDF 中的编码方式自动选择文件扩展名。

如果需要，您还可以在提取图像时打印一些有关图像的详细信息：

        # Optional: print info about image
        w = raw_image.stream_dict.Width
        h = raw_image.stream_dict.Height
        f = raw_image.stream_dict.Filter
        size = raw_image.stream_dict.Length

        print(f"Wrote {name} {w}x{h} {f} {size:,}B {image.colorspace} to {out}")

可以打印类似的东西

Wrote /Im1 150x150 /DCTDecode 5,952B /ICCBased to sample2.pdf-page000-img000.jpg
Wrote /Im10 32x32 /FlateDecode 36B /ICCBased to sample2.pdf-page000-img001.png
...

有关可以对图像执行的更多操作，请参阅文档，包括在 PDF 文件中替换它们。

score 5 · Accepted Answer

更简单的解决方案：

使用 poppler-utils 包。要安装它，请使用 homebrew（homebrew 是特定于 MacOS 的，但您可以在此处找到适用于 Widows 或 Linux 的 poppler-utils 软件包：https ://poppler.freedesktop.org/ ）。下面的第一行代码使用自制软件安装 poppler-utils。安装后第二行（从命令行运行）然后从 PDF 文件中提取图像并将它们命名为“image*”。要在 Python 中运行此程序，请使用 os 或 subprocess 模块。第三行是使用 os 模块的代码，下面是带有子进程的示例（python 3.5 或更高版本的 run() 函数）。更多信息：https ://www.cyberciti.biz/faq/easy-extract-images-from-pdf-file/

brew install poppler

pdfimages file.pdf image

import os
os.system('pdfimages file.pdf image')

或者

import subprocess
subprocess.run('pdfimages file.pdf image', shell=True)

score 4 · Accepted Answer

我在我的服务器上安装了ImageMagick，然后通过以下方式运行命令行调用Popen：

 #!/usr/bin/python

 import sys
 import os
 import subprocess
 import settings

 IMAGE_PATH = os.path.join(settings.MEDIA_ROOT , 'pdf_input' )

 def extract_images(pdf):
     output = 'temp.png'
     cmd = 'convert ' + os.path.join(IMAGE_PATH, pdf) + ' ' + os.path.join(IMAGE_PATH, output)
     subprocess.Popen(cmd.split(), stderr=subprocess.STDOUT, stdout=subprocess.PIPE)

这将为每个页面创建一个图像并将它们存储为 temp-0.png、temp-1.png ...。如果您得到一个只有图像而没有文本的 pdf，这只是“提取”。

score 4 · Accepted Answer

好吧，我已经为此苦苦挣扎了好几个星期，其中许多答案帮助我度过了难关，但总是缺少一些东西，显然这里没有人遇到过jbig2 编码图像的问题。

在我要扫描的一堆 PDF 中，用 jbig2 编码的图像非常受欢迎。

据我了解，有许多复印/扫描机器可以扫描纸张并将其转换为充满 jbig2 编码图像的 PDF 文件。

因此，经过多天的测试，决定寻找 dkagedal 很久以前在这里提出的答案。

这是我在 linux 上的一步一步：（如果你有另一个操作系统，我建议使用linux docker，它会容易得多。）

第一步：

apt-get install poppler-utils

然后我能够像这样运行名为 pdfimages 的命令行工具：

pdfimages -all myfile.pdf ./images_found/

使用上述命令，您将能够提取myfile.pdf中包含的所有图像，并将它们保存在 images_found 中（您必须先创建 images_found ）

在列表中，您会发现几种类型的图像，png、jpg、tiff；所有这些都可以使用任何图形工具轻松阅读。

然后你将有一些文件命名为：-145.jb2e 和 -145.jb2g。

这 2 个文件包含一个用 jbig2 编码的图像，保存在 2 个不同的文件中，一个用于标题，一个用于数据

同样，我花了很多天试图找出如何将这些文件转换为可读的东西，最后我遇到了这个名为 jbig2dec 的工具

所以首先你需要安装这个神奇的工具：

apt-get install jbig2dec

然后你可以运行：

jbig2dec -t png -145.jb2g -145.jb2e

您最终将能够将所有提取的图像转换成有用的东西。

祝你好运！

score 4 · Accepted Answer

经过一番搜索，我发现以下脚本非常适合我的 PDF。它只处理 JPG，但它与我未受保护的文件完美配合。也不需要任何外部库。

恕我直言，剧本出自 Ned Batchelder，而不是我。Python3 代码：从 pdf 中提取 jpg。又快又脏

import sys

with open(sys.argv[1],"rb") as file:
    file.seek(0)
    pdf = file.read()

startmark = b"\xff\xd8"
startfix = 0
endmark = b"\xff\xd9"
endfix = 2
i = 0

njpg = 0
while True:
    istream = pdf.find(b"stream", i)
    if istream < 0:
        break
    istart = pdf.find(startmark, istream, istream + 20)
    if istart < 0:
        i = istream + 20
        continue
    iend = pdf.find(b"endstream", istart)
    if iend < 0:
        raise Exception("Didn't find end of stream!")
    iend = pdf.find(endmark, iend - 20)
    if iend < 0:
        raise Exception("Didn't find end of JPG!")

    istart += startfix
    iend += endfix
    print("JPG %d from %d to %d" % (njpg, istart, iend))
    jpg = pdf[istart:iend]
    with open("jpg%d.jpg" % njpg, "wb") as jpgfile:
        jpgfile.write(jpg)

    njpg += 1
    i = iend

score 4 · Accepted Answer

我为自己的程序做了这个，发现最好使用的库是 PyMuPDF。它可以让您找出每页上每个图像的“外部参照”编号，并使用它们从 PDF 中提取原始图像数据。

import fitz
from PIL import Image
import io

filePath = "path/to/file.pdf"
#opens doc using PyMuPDF
doc = fitz.Document(filePath)

#loads the first page
page = doc.loadPage(0)

#[First image on page described thru a list][First attribute on image list: xref n], check PyMuPDF docs under getImageList()
xref = page.getImageList()[0][0]

#gets the image as a dict, check docs under extractImage 
baseImage = doc.extractImage(xref)

#gets the raw string image data from the dictionary and wraps it in a BytesIO object before using PIL to open it
image = Image.open(io.BytesIO(baseImage['image']))

#Displays image for good measure
image.show()

不过，一定要查看文档。

score 3 · Accepted Answer

使用pyPDF2阅读帖子后。

使用@sylvain 的代码时的错误NotImplementedError: unsupported filter /DCTDecode必须来自方法.getData()：使用时解决._data了，由@Alex Paramonov。

到目前为止，我只遇到过“DCTDecode”案例，但我正在分享包含来自不同帖子的评论的改编代码：来自@Alex zilbParamonov，sub_obj['/Filter']作为列表，来自@mxl。

希望它可以帮助 pyPDF2 用户。按照代码：

    import sys
    import PyPDF2, traceback
    import zlib
    try:
        from PIL import Image
    except ImportError:
        import Image

    pdf_path = 'path_to_your_pdf_file.pdf'
    input1 = PyPDF2.PdfFileReader(open(pdf_path, "rb"))
    nPages = input1.getNumPages()

    for i in range(nPages) :
        page0 = input1.getPage(i)

        if '/XObject' in page0['/Resources']:
            try:
                xObject = page0['/Resources']['/XObject'].getObject()
            except :
                xObject = []

            for obj_name in xObject:
                sub_obj = xObject[obj_name]
                if sub_obj['/Subtype'] == '/Image':
                    zlib_compressed = '/FlateDecode' in sub_obj.get('/Filter', '')
                    if zlib_compressed:
                       sub_obj._data = zlib.decompress(sub_obj._data)

                    size = (sub_obj['/Width'], sub_obj['/Height'])
                    data = sub_obj._data#sub_obj.getData()
                    try :
                        if sub_obj['/ColorSpace'] == '/DeviceRGB':
                            mode = "RGB"
                        elif sub_obj['/ColorSpace'] == '/DeviceCMYK':
                            mode = "CMYK"
                            # will cause errors when saving (might need convert to RGB first)
                        else:
                            mode = "P"

                        fn = 'p%03d-%s' % (i + 1, obj_name[1:])
                        if '/Filter' in sub_obj:
                            if '/FlateDecode' in sub_obj['/Filter']:
                                img = Image.frombytes(mode, size, data)
                                img.save(fn + ".png")
                            elif '/DCTDecode' in sub_obj['/Filter']:
                                img = open(fn + ".jpg", "wb")
                                img.write(data)
                                img.close()
                            elif '/JPXDecode' in sub_obj['/Filter']:
                                img = open(fn + ".jp2", "wb")
                                img.write(data)
                                img.close()
                            elif '/CCITTFaxDecode' in sub_obj['/Filter']:
                                img = open(fn + ".tiff", "wb")
                                img.write(data)
                                img.close()
                            elif '/LZWDecode' in sub_obj['/Filter'] :
                                img = open(fn + ".tif", "wb")
                                img.write(data)
                                img.close()
                            else :
                                print('Unknown format:', sub_obj['/Filter'])
                        else:
                            img = Image.frombytes(mode, size, data)
                            img.save(fn + ".png")
                    except:
                        traceback.print_exc()
        else:
            print("No image found for page %d" % (i + 1))

score 2 · Accepted Answer

截至 2019 年 2 月，@sylvain 给出的解决方案（至少在我的设置中）在没有小的修改的情况下不起作用：xObject[obj]['/Filter']不是一个值，而是一个列表，因此为了使脚本工作，我不得不修改格式检查如下：

import PyPDF2, traceback

from PIL import Image

input1 = PyPDF2.PdfFileReader(open(src, "rb"))
nPages = input1.getNumPages()
print nPages

for i in range(nPages) :
    print i
    page0 = input1.getPage(i)
    try :
        xObject = page0['/Resources']['/XObject'].getObject()
    except : xObject = []

    for obj in xObject:
        if xObject[obj]['/Subtype'] == '/Image':
            size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
            data = xObject[obj].getData()
            try :
                if xObject[obj]['/ColorSpace'] == '/DeviceRGB':
                    mode = "RGB"
                elif xObject[obj]['/ColorSpace'] == '/DeviceCMYK':
                    mode = "CMYK"
                    # will cause errors when saving
                else:
                    mode = "P"

                fn = 'p%03d-%s' % (i + 1, obj[1:])
                print '\t', fn
                if '/FlateDecode' in xObject[obj]['/Filter'] :
                    img = Image.frombytes(mode, size, data)
                    img.save(fn + ".png")
                elif '/DCTDecode' in xObject[obj]['/Filter']:
                    img = open(fn + ".jpg", "wb")
                    img.write(data)
                    img.close()
                elif '/JPXDecode' in xObject[obj]['/Filter'] :
                    img = open(fn + ".jp2", "wb")
                    img.write(data)
                    img.close()
                elif '/LZWDecode' in xObject[obj]['/Filter'] :
                    img = open(fn + ".tif", "wb")
                    img.write(data)
                    img.close()
                else :
                    print 'Unknown format:', xObject[obj]['/Filter']
            except :
                traceback.print_exc()

score 1 · Accepted Answer

pdfimages您也可以在 Ubuntu 中使用命令。

使用以下命令安装 poppler lib。

sudo apt install poppler-utils

sudo apt-get install python-poppler

pdfimages file.pdf image

创建的文件列表是，（例如，pdf中有两个图像）

image-000.png
image-001.png

有用！现在您可以使用 asubprocess.run从 python 运行它。

score 1 · Accepted Answer

我在这里将所有这些都添加到 PyPDFTK 中。

我自己的贡献是/Indexed这样处理文件：

for obj in xObject:
    if xObject[obj]['/Subtype'] == '/Image':
        size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
        color_space = xObject[obj]['/ColorSpace']
        if isinstance(color_space, pdf.generic.ArrayObject) and color_space[0] == '/Indexed':
            color_space, base, hival, lookup = [v.getObject() for v in color_space] # pg 262
        mode = img_modes[color_space]

        if xObject[obj]['/Filter'] == '/FlateDecode':
            data = xObject[obj].getData()
            img = Image.frombytes(mode, size, data)
            if color_space == '/Indexed':
                img.putpalette(lookup.getData())
                img = img.convert('RGB')
            img.save("{}{:04}.png".format(filename_prefix, i))

请注意，/Indexed找到文件后，您不能只与/ColorSpace字符串进行比较，因为它以ArrayObject. 因此，我们必须检查数组并检索索引调色板（lookup在代码中）并将其设置在 PIL Image 对象中，否则它保持未初始化（零）并且整个图像显示为黑色。

我的第一直觉是将它们保存为 GIF（这是一种索引格式），但我的测试结果表明 PNG 更小并且看起来相同。

我在使用 Foxit Reader PDF Printer 打印到 PDF 时发现了这些类型的图像。

score 0 · Accepted Answer

我将解决方案重写为单个 python 类。它应该很容易使用。如果您注意到新的“/Filter”或“/ColorSpace”，那么只需将其添加到内部字典即可。

https://github.com/survtur/extract_images_from_pdf

要求：

Python3.6+
PyPDF2
太平船务

score 0 · Accepted Answer

试试下面的代码。它将从pdf中提取所有图像。

    import sys
    import PyPDF2
    from PIL import Image
    pdf=sys.argv[1]
    print(pdf)
    input1 = PyPDF2.PdfFileReader(open(pdf, "rb"))
    for x in range(0,input1.numPages):
        xObject=input1.getPage(x)
        xObject = xObject['/Resources']['/XObject'].getObject()
        for obj in xObject:
            if xObject[obj]['/Subtype'] == '/Image':
                size = (xObject[obj]['/Width'], xObject[obj]['/Height'])
                print(size)
                data = xObject[obj]._data
                #print(data)
                print(xObject[obj]['/Filter'])
                if xObject[obj]['/Filter'][0] == '/DCTDecode':
                    img_name=str(x)+".jpg"
                    print(img_name)
                    img = open(img_name, "wb")
                    img.write(data)
                    img.close()
        print(str(x)+" is done")

score 0 · Accepted Answer

首先安装pdf2image

pip install pdf2image==1.14.0

按照以下代码从 PDF 中提取页面。

file_path="file path of PDF"
info = pdfinfo_from_path(file_path, userpw=None, poppler_path=None)
maxPages = info["Pages"]
image_counter = 0
if maxPages > 10:
    for page in range(1, maxPages, 10):
        pages = convert_from_path(file_path, dpi=300, first_page=page, 
                last_page=min(page+10-1, maxPages))
        for page in pages:
            page.save(image_path+'/' + str(image_counter) + '.png', 'PNG')
            image_counter += 1
else:
    pages = convert_from_path(file_path, 300)
    for i, j in enumerate(pages):
        j.save(image_path+'/' + str(i) + '.png', 'PNG')

希望它可以帮助编码人员根据 PDF 页面轻松地将 PDF 文件转换为图像。

python - 在python中从PDF中提取图像而不重新采样？

21 回答 21

Related

Reference