validation - PdfBox：PDF/A-1A 到 PDF/A-3A

Question

我有以下问题：我想将 PDF/A-1A 文档转换为 PDF/A-3A。原始文档由 Arobat Reader Pro 验证，因此我可以假设它符合 PDF/A-1A。

我尝试使用以下代码转换 PDF 元数据：

private PDDocumentCatalog makeA3compliant(PDDocument doc) throws IOException, TransformerException  {
PDDocumentCatalog cat = doc.getDocumentCatalog();
PDMetadata metadata = new PDMetadata(doc);
cat.setMetadata(metadata);

XMPMetadata xmp = new XMPMetadata();
XMPSchemaPDFAId pdfaid = new XMPSchemaPDFAId(xmp);
xmp.addSchema(pdfaid);

XMPSchemaDublinCore dc = xmp.addDublinCoreSchema();
String creator = "TestCr";
String producer = "testPr";
dc.addCreator(creator);
dc.setAbout("");

XMPSchemaBasic xsb = xmp.addBasicSchema();
xsb.setAbout("");
xsb.setCreatorTool(creator);
xsb.setCreateDate(GregorianCalendar.getInstance());

PDDocumentInformation pdi = new PDDocumentInformation();
pdi.setProducer(producer);
pdi.setAuthor(creator);
doc.setDocumentInformation(pdi);

XMPSchemaPDF pdf = xmp.addPDFSchema();
pdf.setProducer(producer);
pdf.setAbout("");

PDMarkInfo markinfo = new PDMarkInfo();
markinfo.setMarked(true);
doc.getDocumentCatalog().setMarkInfo(markinfo);

pdfaid.setPart(3);
pdfaid.setConformance("A");
pdfaid.setAbout("");

metadata.importXMPMetadata(xmp);

return cat;

}

如果我再次尝试使用 Acrobat 验证新文件，则会收到验证错误：

子集字体中的 CIDset 不完整（字体包含未列出的字形）

如果我尝试使用此在线验证器（http://www.pdf-tools.com/pdf/validate-pdfa-online.aspx）验证文件，它是有效的 PDF/A-3A....

我错过了什么吗？

没有人能帮忙吗？

编辑：这是PDF文件

score 3 · Accepted Answer

好的 - 我想我从 callas 和/或 Adobe 技术的角度回答了您的问题（我再次隶属于 callas 及其也在 Acrobat 内部使用的 pdfToolbox 技术）。

根据我的研究和咨询过的人，您的示例 PDF 文档包含的字体具有不完整的 CID 字符集。为什么 pdfToolbox 或 Acrobat 说它是有效的 PDF/A-1a 文件而不是有效的 PDF/A-3a 文件？有趣的问题：

1)在 PDF/A-1a 和 PDF/A-3a 之间更改了不完整 CID 集的规则。它们在 PDF/A-3a中更为严格。

2) 但是，虽然在 PDF/A-1a 中必须始终存在 CID 集，但在 PDF/A-3a 中，您可以拥有一个有效的、合规的文件，而无需这样的 CID 集。

因此，您的 PDF 文件包含一个 CID 集（这使其对 PDF/A-1a 和 A-3a 有效），但是虽然该 CID 集适用于 A-1a，但它并不包含符合 A-3a 的所有字符。

为了测试该理论的至少一部分，我通过 pdfToolbox 处理了您的文件，并带有一个名为“如果不完整则删除 CIDset”的修复程序。该更正（顾名思义）会从文件中删除 CID 集，但不会更改其他任何内容。这样做后，您的文件将被验证为有效的 A-3a 文件。

这就留下了为什么 pdftools 网站声称这是一个有效的 PDF/A-3a 文件的问题；根据与我交谈过的人的说法，该文件的预检结果是正确的，并且该文件应该有错误。因此，也许这就是您需要与 pdftools 人员讨论的问题（他们可能会与 callas 一起找出最终正确的人）。

如果您想进一步讨论这个问题，请随时给我发送个人信息——更多关于工具本身的讨论可能会成为这个公共网站的题外话。

score 3 · Accepted Answer

这使我们在 CIDset 问题上完全符合 PDF/A-3：

private void removeCidSet(PDDocumentCatalog catalog) {

  COSName cidSet = COSName.getPDFName("CIDSet");

  // iterate over all pdf pages
  for (Object object : catalog.getAllPages()) {
    if (object instanceof PDPage) {

      PDPage page = (PDPage) object;
      Map<String, PDFont> fonts = page.getResources().getFonts();
      Iterator<String> iterator = fonts.keySet().iterator();

      // iterate over all fonts
      while (iterator.hasNext()) {
        PDFont pdFont = fonts.get(iterator.next());

        if (pdFont instanceof PDType0Font) {
          PDType0Font typedFont = (PDType0Font) pdFont;

          if (typedFont.getDescendantFont() instanceof PDCIDFontType2Font) {
            PDCIDFontType2Font f = (PDCIDFontType2Font) typedFont.getDescendantFont();
            PDFontDescriptor fontDescriptor = f.getFontDescriptor();

            if (fontDescriptor instanceof PDFontDescriptorDictionary) {
              PDFontDescriptorDictionary fontDict = (PDFontDescriptorDictionary) fontDescriptor;
              fontDict.getCOSDictionary().removeItem(cidSet);
            }
          }
        }
      }
    }
  }
}

validation - PdfBox：PDF/A-1A 到 PDF/A-3A

2 回答 2

Related

Reference