java - 提取未在 NAMES 中列出的嵌入文件

Question

Catalog / AF[0] / EF / F

AF是一个数组
第一个条目是文件规范字典
EF是字典
F应该是嵌入式文件流

使用 PDFBox 我可以做到这一点：

PDFParser parser = new PDFParser(is);
parser.parse();
PDDocument document = parser.getPDDocument();
PDDocumentCatalog catalog = document.getDocumentCatalog();
PDDocumentNameDictionary namesDictionary = new PDDocumentNameDictionary(catalog);
PDEmbeddedFilesNameTreeNode embeddedFiles = namesDictionary.getEmbeddedFiles();
List<PDNameTreeNode> kids = embeddedFiles.getKids();
PDEmbeddedFilesNameTreeNode node = (PDEmbeddedFilesNameTreeNode) kids.get(0);
COSDictionary cosDictionary = node.getCOSDictionary();
COSArray a = (COSArray) cosDictionary.getDictionaryObject(COSName.NAMES);
COSDictionary d = (COSDictionary) a.getObject(1);
COSDictionary ef = (COSDictionary) d.getDictionaryObject(COSName.EF);
COSDictionary f = (COSDictionary) ef.getDictionaryObject(COSName.F);
System.out.println(f);

输出（格式化以获得更好的可读性）：

COSDictionary{(COSName{Length}:COSInt{1433})
              (COSName{Filter}:COSName{FlateDecode})
              (COSName{Type}:COSName{EmbeddedFile})
              (COSName{Subtype}:COSName{text/xml})
              (COSName{Params}:COSDictionary{
                (COSName{Size}:COSInt{12030})
                (COSName{ModDate}:COSString{D:20130628111510+02'00'})
               }
              )
             }

这是我迄今为止所期望的。但是这个嵌入的 XML 文件的字节在哪里呢？我怎样才能访问它们？

score 1 · Accepted Answer

我找到了一个更简单的方法。由于嵌入的文件在KIDS, not下NAMES，这是正确的，这有效：

List<PDNameTreeNode> kids = embeddedFiles.getKids();
if (kids != null) {
  for (PDNameTreeNode kid : embeddedFiles.getKids()) {
    PDComplexFileSpecification spec =
      (PDComplexFileSpecification) kid.getValue(ZUGFERD_XML_FILENAME);
    PDEmbeddedFile file = spec.getEmbeddedFile();
    return file.getByteArray();
  }
}

score 0 · Accepted Answer

也许您应该看看 PDFBox 提供的ExtractEmbeddedFiles示例。它描述了如何提取所有类型的嵌入文件。

java - 提取未在 NAMES 中列出的嵌入文件

2 回答 2

Related

Reference