java - 如何从 PDF 中提取图像及其元数据？

Question

是否可以使用 Java 从 PDF 文件中提取图像并将其导出到特定文件夹而不会丢失其原始创建和修改日期？我试图通过使用 IText 和 PDFBox 来实现这个目标，但没有成功。欢迎任何想法或示例。

score 6 · Accepted Answer

图像不包含元数据，并存储为需要组装成图像的原始数据。我在https://blog.idrsolutions.com/2010/04/understanding-the-pdf-file-format-h ow-are-images-stored/和https上写了 2 篇博客文章，解释图像数据如何存储在 PDF 文件中 ://blog.idrsolutions.com/2010/09/understanding-the-pdf-file-format-images/

score 4 · Accepted Answer

我不同意其他人的观点，并为您的问题提供了 POC：您可以通过以下方式使用pdfbox提取图像的 XMP 元数据：

public void getXMPInformation() {
    // Open PDF document
    PDDocument document = null;
    try {
        document = PDDocument.load(PATH_TO_YOUR_DOCUMENT);
    } catch (IOException e) {
        e.printStackTrace();
    }
    // Get all pages and loop through them
    List pages = document.getDocumentCatalog().getAllPages();
    Iterator iter = pages.iterator();
    while( iter.hasNext() ) {
        PDPage page = (PDPage)iter.next();
        PDResources resources = page.getResources();            
        Map images = null;
        // Get all Images on page
        try {
            images = resources.getImages();
        } catch (IOException e) {
            e.printStackTrace();
        }
        if( images != null ) {
            // Check all images for metadata
            Iterator imageIter = images.keySet().iterator();
            while( imageIter.hasNext() ) {
                String key = (String)imageIter.next();
                PDXObjectImage image = (PDXObjectImage)images.get( key );
                PDMetadata metadata = image.getMetadata();
                System.out.println("Found a image: Analyzing for Metadata");
                if (metadata == null) {
                    System.out.println("No Metadata found for this image.");
                } else {
                    InputStream xmlInputStream = null;
                    try {
                        xmlInputStream = metadata.createInputStream();
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                    try {
                        System.out.println("--------------------------------------------------------------------------------");
                        String mystring = convertStreamToString(xmlInputStream);
                        System.out.println(mystring);
                    } catch (IOException e) {
                        e.printStackTrace();
                    }
                }
                // Export the images
                String name = getUniqueFileName( key, image.getSuffix() );
                    System.out.println( "Writing image:" + name );
                    try {
                        image.write2file( name );
                    } catch (IOException e) {
                        // TODO Auto-generated catch block
                        //e.printStackTrace();
                }
                System.out.println("--------------------------------------------------------------------------------");
            }
        }
    }
}

和“辅助方法”：

public String convertStreamToString(InputStream is) throws IOException {
    /*
     * To convert the InputStream to String we use the BufferedReader.readLine()
     * method. We iterate until the BufferedReader return null which means
     * there's no more data to read. Each line will appended to a StringBuilder
     * and returned as String.
     */
    if (is != null) {
        StringBuilder sb = new StringBuilder();
        String line;

        try {
            BufferedReader reader = new BufferedReader(new InputStreamReader(is, "UTF-8"));
            while ((line = reader.readLine()) != null) {
                sb.append(line).append("\n");
            }
        } finally {
            is.close();
        }
        return sb.toString();
    } else {       
        return "";
    }
}

private String getUniqueFileName( String prefix, String suffix ) {
    /*
    * imagecounter is a global variable that counts from 0 to the number of
    * extracted images
    */
    String uniqueName = null;
    File f = null;
    while( f == null || f.exists() ) {
        uniqueName = prefix + "-" + imageCounter;
        f = new File( uniqueName + "." + suffix );
    }
    imageCounter++;
    return uniqueName;
}

注意：这是一个快速而肮脏的概念证明，而不是一个风格良好的代码。

在构建 PDF 文档之前，在 InDesign 中放置的图像必须具有 XMP 元数据。例如，可以使用 Photoshop 设置 XMP-Metdadata。请注意，并非所有 IPTC/Exif/... 信息都转换为 XMP 元数据。只有少数字段被转换。

我在 JPG 和 PNG 图像上使用此方法，这些图像放置在使用 InDesign 构建的 PDF 中。它运行良好，我可以在制作步骤后从准备好的 PDF（图片涂层）中获取所有图像信息。

score 3 · Accepted Answer

简答

也许，但可能不是。

长答案

PDF 原生支持 JPEG、JPEG2000（越来越普遍）、CITT（传真）3 & 4 和 JBIG2（非常少见）。这些格式的图像可以逐字节复制到 PDF 中，保留文件中的任何元数据。创建/更改日期通常是文件系统的一部分，而不是图像。

JPEG：看起来它不支持内部元数据。

JPEG2000：是的。里面可能有很多东西

CITT：看起来不是这样。

JBIG2：Err.. 我想是这样，但从我刚刚浏览的规格中还不清楚。

所有其他图像格式必须转换为像素，然后以某种方式压缩（通常使用 Flate/ZIP）。这些转换可以将元数据保留为 PDF 的 xml 元数据或图像字典的一部分，但我什至从未听说过这种情况。它只是被推销。

score 1 · Accepted Answer

图像嵌入 PDF 时，一般不会保存原始创建和修改日期。只是原始像素数据被压缩和保存。但是，根据维基百科：

PDF 中的光栅图像（称为 Image XObjects）由具有关联流的字典表示。

字典包含元数据，您可能会在其中找到日期。

score 0 · Accepted Answer

使用 SonwTide API 从 PDF 文件中获取元数据。使用 PDFTextStream.jar 最后它将返回所有 PDF 属性并在命令行上打印。

public static void getPDFMetaData(String pdfFilePath) throws IOException{

            // input pdf file with location Add PDFTextStream.jar from snowtide web site to your code build path
            PDFTextStream stream = new PDFTextStream(pdfFilePath);

            // get collection of all document attribute names
            Set attributeKeys = stream.getAttributeKeys();

            // print the values of all document attributes to System.out
            Iterator iter = attributeKeys.iterator();
            String attrKey;
            while (iter.hasNext()) {
                attrKey = (String)iter.next();
                System.out.println(attrKey + " = " + stream.getAttribute(attrKey));

            }


}

java - 如何从 PDF 中提取图像及其元数据？

5 回答 5

简答

长答案

Related

Reference