python - 如何将提取的图像写入文件对象而不是文件系统？

Question

我正在使用 Python pdfminer 库从 PDF 中提取文本和图像。由于TextConverter 类默认写入sys.stdout，因此我曾经StringIO将文本作为变量捕获，如下所示（请参阅粘贴：

def extractTextAndImagesFromPDF(rawFile):
    laparams = LAParams()
    imagewriter = ImageWriter('extractedImageFolder/')    
    resourceManager = PDFResourceManager(caching=True)

    outfp = StringIO()  # Use StringIO to catch the output later.
    device = TextConverter(resourceManager, outfp, codec='utf-8', laparams=laparams, imagewriter=imagewriter)
    interpreter = PDFPageInterpreter(resourceManager, device)
    for page in PDFPage.get_pages(rawFile, set(), maxpages=0, caching=True, check_extractable=True):
        interpreter.process_page(page)
    device.close()    
    extractedText = outfp.getvalue()  # Get the text from the StringIO
    outfp.close()
    return extractedText

这适用于提取的文本。此功能还可以提取 PDF 中的图像并将它们写入'extractedImageFolder/'. 这也很好，但我现在希望将图像“写入”文件对象而不是文件系统，以便我可以对它们进行一些后期处理。

ImageWriter 类定义了一个文件 ( fp = file(path, 'wb'))，然后写入该文件。我想要的是我的extractTextAndImagesFromPDF()函数还可以返回文件对象列表，而不是直接将它们写入文件。我想我也需要使用StringIO它，但我不知道如何。部分还因为写入文件是在 pdfminer 中发生的。

有谁知道我如何返回文件对象列表而不是将图像写入文件系统？欢迎所有提示！

score 1 · Accepted Answer

这是一个技巧，可让您提供自己的文件指针以写入：

   # add option in aguments to supply your own file pointer
   def export_image(self, image, fp=None):
        ...
        # change this line:
        # fp = file(path, 'wb')
        # add instead:
        fp = fp if fp else file(path, 'wb')
        ...
        # and this line:
        # return name
        # add instead:
        return (fp, name,) if fp else name

现在你需要使用：

# create file-like object backed by string buffer
fp = stringIO.stringIO()
image_fp, name = export_image(image, fp)

并且您的图像应存储在fp.

请注意， to 的行为export_image，如果它在其他地方使用，则保持不变。

python - 如何将提取的图像写入文件对象而不是文件系统？

1 回答 1

Related

Reference