java - 使用 PDFBox 删除未启用的可选内容组

Question

我正在使用来自 java 的 apache PDFBox，并且我有一个包含多个可选内容组的源 PDF。我想要做的是导出仅包含标准内容和启用的可选内容组的 PDF 版本。对我的目的而言，保留原件的任何动态方面很重要......所以文本字段仍然是文本字段，矢量图像仍然是矢量图像等。需要这样做的原因是因为我打算最终使用一个 pdf 表单编辑器程序，它不知道如何处理可选内容，并且会盲目地渲染所有这些内容，所以我想对源 pdf 进行预处理，并在不太混乱的目标 pdf 上使用表单编辑程序。

我一直在试图找到一些可以给我任何关于如何用谷歌做到这一点的提示，但无济于事。我不知道我是否只是使用了错误的搜索词，或者这是否超出了 PDFBox API 的设计目的。我宁愿希望不是后者。此处显示的信息似乎不起作用（将 C# 代码转换为 java），因为尽管我尝试导入具有可选内容的 pdf，但当我检查每个页面上的令牌时似乎没有任何 OC 资源。

    for(PDPage page:pages) {
        PDResources resources = page.getResources();            
        PDFStreamParser parser = new PDFStreamParser(page);
        parser.parse();
        Collection tokens = parser.getTokens();
        ...
    }

我真的很抱歉没有更多的代码来展示我到目前为止所做的尝试，但我已经研究了大约 8 个小时的 java API 文档，试图弄清楚我可能需要做什么，只是无法弄清楚。

我所知道的是将文本、线条和图像添加到新的PDPage，但我不知道如何从给定的源页面中检索该信息以将其复制过来，也不知道如何区分这些信息是哪个可选内容组是（如果有的话）的一部分。我也不确定如何将源 pdf 中的表单字段复制到目标，也不知道如何复制字体信息。

老实说，如果有一个网页我无法通过我尝试的搜索在谷歌上找到，我会非常乐意阅读更多关于它的信息，但我真的很困在这里，我不不认识任何知道这个图书馆的人。

请帮忙。

编辑：尝试我从下面的建议中理解的内容，我编写了一个循环来检查页面上的每个 XObject，如下所示：

PDResources resources = pdPage.getResources();
Iterable<COSName> names = resources.getXObjectNames();
for(COSName name:names) {
    PDXObject xobj = resources.getXObject(name);
    PDFStreamParser parser = new PDFStreamParser(xobj.getStream().toByteArray());
    parser.parse();
    Object [] tokens = parser.getTokens().toArray();
    for(int i = 0;i<tokens.length-1;i++) {
        Object obj = tokens[i];
        if (obj instanceof COSName && obj.equals(COSName.OC)) {
            i++;
            Object obj = tokens[i];
            if (obj instanceof COSName) {
                PDPropertyList props = resources.getProperties((COSName)obj);
                if (props != null) {
...

但是，在 OC 键之后，tokens数组中的下一个条目始终是Operator标记为“BMC”。我在任何地方都找不到任何可以从命名的可选内容组中识别的信息。

score 2 · Accepted Answer

这是删除标记内容块的强大解决方案（如果有人发现任何不正常的内容，欢迎反馈）。您应该能够针对 OC 块进行调整...

此代码正确处理资源的嵌套和删除（xobject、图形状态和字体 - 如果需要，可以轻松添加其他资源）。

public class MarkedContentRemover {

    private final MarkedContentMatcher matcher;
    
    /**
     * 
     */
    public MarkedContentRemover(MarkedContentMatcher matcher) {
        this.matcher = matcher;
    }
    
    public int removeMarkedContent(PDDocument doc, PDPage page) throws IOException {
        ResourceSuppressionTracker resourceSuppressionTracker = new ResourceSuppressionTracker();
        
        PDResources pdResources = page.getResources();

        PDFStreamParser pdParser = new PDFStreamParser(page);
        
        
        PDStream newContents = new PDStream(doc);
        OutputStream newContentOutput = newContents.createOutputStream(COSName.FLATE_DECODE);
        ContentStreamWriter newContentWriter = new ContentStreamWriter(newContentOutput);
        
        List<Object> operands = new ArrayList<>();
        Operator operator = null;
        Object token;
        int suppressDepth = 0;
        boolean resumeOutputOnNextOperator = false;
        int removedCount = 0;
        
        while (true) {

            operands.clear();
            token = pdParser.parseNextToken();
            while(token != null && !(token instanceof Operator)) {
                operands.add(token);
                token = pdParser.parseNextToken();
            }
            operator = (Operator)token;
            
            if (operator == null) break;
            
            if (resumeOutputOnNextOperator) {
                resumeOutputOnNextOperator = false;
                suppressDepth--;
                if (suppressDepth == 0)
                    removedCount++;
            }
            
            if (OperatorName.BEGIN_MARKED_CONTENT_SEQ.equals(operator.getName())
                    || OperatorName.BEGIN_MARKED_CONTENT.equals(operator.getName())) {
                
                COSName contentId = (COSName)operands.get(0);

                final COSDictionary properties;
                if (operands.size() > 1) {
                    Object propsOperand = operands.get(1);
                    
                    if (propsOperand instanceof COSDictionary) {
                        properties = (COSDictionary) propsOperand;
    
                    } else if (propsOperand instanceof COSName) {
                        properties = pdResources.getProperties((COSName)propsOperand).getCOSObject();
                    } else {
                        properties = new COSDictionary();
                    }
                } else {
                    properties = new COSDictionary();
                }
                
                if (matcher.matches(contentId, properties)) {
                    suppressDepth++;
                }
                
            }
        
            if (OperatorName.END_MARKED_CONTENT.equals(operator.getName())) {
                if (suppressDepth > 0)
                    resumeOutputOnNextOperator = true;
            }

            else if (OperatorName.SET_GRAPHICS_STATE_PARAMS.equals(operator.getName())) {
                resourceSuppressionTracker.markForOperator(COSName.EXT_G_STATE, operands.get(0), suppressDepth == 0);
            }

            else if (OperatorName.DRAW_OBJECT.equals(operator.getName())) {
                resourceSuppressionTracker.markForOperator(COSName.XOBJECT, operands.get(0), suppressDepth == 0);
            }
            
            else if (OperatorName.SET_FONT_AND_SIZE.equals(operator.getName())) {
                resourceSuppressionTracker.markForOperator(COSName.FONT, operands.get(0), suppressDepth == 0);
            }
            
            

            if (suppressDepth == 0) {
                newContentWriter.writeTokens(operands);
                newContentWriter.writeTokens(operator);
            }

        }
        
        if (resumeOutputOnNextOperator)
            removedCount++;

        

        newContentOutput.close();

        page.setContents(newContents);
        
        resourceSuppressionTracker.updateResources(pdResources);
        
        return removedCount;
    }

    
    private static class ResourceSuppressionTracker{
        // if the boolean is TRUE, then the resource should be removed.  If the boolean is FALSE, the resource should not be removed
        private final Map<COSName, Map<COSName, Boolean>> tracker = new HashMap<>();
        
        public void markForOperator(COSName resourceType, Object resourceNameOperand, boolean preserve) {
            if (!(resourceNameOperand instanceof COSName)) return;
            if (preserve) {
                markForPreservation(resourceType, (COSName)resourceNameOperand);
            } else {
                markForRemoval(resourceType, (COSName)resourceNameOperand);
            }
        }
        
        public void markForRemoval(COSName resourceType, COSName refId) {
            if (!resourceIsPreserved(resourceType, refId)) {
                getResourceTracker(resourceType).put(refId, Boolean.TRUE);
            }
        }

        public void markForPreservation(COSName resourceType, COSName refId) {
            getResourceTracker(resourceType).put(refId, Boolean.FALSE);
        }
        
        public void updateResources(PDResources pdResources) {
            for (Map.Entry<COSName, Map<COSName, Boolean>> resourceEntry : tracker.entrySet()) {
                for(Map.Entry<COSName, Boolean> refEntry : resourceEntry.getValue().entrySet()) {
                    if (refEntry.getValue().equals(Boolean.TRUE)) {
                        pdResources.getCOSObject().getCOSDictionary(COSName.XOBJECT).removeItem(refEntry.getKey());
                    }
                }
            }
        }
        
        private boolean resourceIsPreserved(COSName resourceType, COSName refId) {
            return getResourceTracker(resourceType).getOrDefault(refId, Boolean.FALSE);
        }
        
        private Map<COSName, Boolean> getResourceTracker(COSName resourceType){
            if (!tracker.containsKey(resourceType)) {
                tracker.put(resourceType, new HashMap<>());
            }
            
            return tracker.get(resourceType);
            
        }
    }
    
}

助手类：

public interface MarkedContentMatcher {
    public boolean matches(COSName contentId, COSDictionary props);
}

score 0 · Accepted Answer

可选内容组标有 BDC 和 EMC。您必须浏览从解析器返回的所有标记并从数组中删除“部分”。这是前段时间发布的一些 C# 代码 - [1]：如何使用 pdfbox 从 pdf 中删除可选内容组及其内容？

我对此进行了调查（转换为 Java），但无法按预期工作。我设法删除了 BDC 和 EMC 之间的内容，然后使用与示例相同的技术保存了结果，但 PDF 已损坏。也许那是我缺乏 C# 知识（与元组等有关）

这是我想出的，正如我所说，它不起作用也许你或其他人（mkl，Tilman Hausherr）可以发现这个缺陷。

    OCGDelete (PDDocument doc, int pageNum, String OCName) {
      PDPage pdPage = (PDPage) doc.getDocumentCatalog().getPages().get(pageNum);
      PDResources pdResources = pdPage.getResources();
      PDFStreamParser pdParser = new PDFStreamParser(pdPage);

      int ocgStart
      int ocgLength

      Collection tokens = pdParser.getTokens();
      Object[] newTokens = tokens.toArray()

      try {
        for (int index = 0; index < newTokens.length; index++) {
            obj = newTokens[index]
            if (obj instanceof COSName && obj.equals(COSName.OC)) {
                // println "Found COSName at "+index   /// Found Optional Content
                startIndex = index
                index++
                if (index < newTokens.size()) {
                    obj = newTokens[index]
                    if (obj instanceof COSName) {
                        prop = pdRes.getProperties(obj)
                        if (prop != null && prop instanceof PDOptionalContentGroup) {
                            if ((prop.getName()).equals(delLayer)) {
                                println "Found the Layer to be deleted"
                                println "prop Name was " + prop.getName()

                                index++

                                if (index < newTokens.size()) {
                                    obj = newTokens[index]

                                    if ((obj.getName()).equals("BDC")) {
                                        ocgStart = index
                                        println("OCG Start " + ocgStart)
                                        ocgLength = -1
                                        index++

                                        while (index < newTokens.size()) {
                                            ocgLength++
                                            obj = newTokens[index]
                                            println " Loop through relevant OCG Tokens " + obj
                                            if (obj instanceof Operator && (obj.getName()).equals("EMC")) {

                                                println "the next obj was " + obj
                                                println "after that " + newTokens[index + 1] + "and then " + newTokens[index + 2]
                                                println("OCG End " + ocgLength++)
                                                break

                                            }

                                            index++
                                        }
                                        if (endIndex > 0) {
                                            println "End Index was something " + (startIndex + ocgLength)

                                        }
                                    }
                                }
                            }
                        }
                    }
                }
            }
        }
    }
    catch (Exception ex){
        println ex.message()
    }

    for (int i = ocgStart; i < ocgStart+ ocgLength; i++){
        newTokens.removeAt(i)
    }


    PDStream newContents = new PDStream(doc);
    OutputStream output = newContents.createOutputStream(COSName.FLATE_DECODE);
    ContentStreamWriter writer = new ContentStreamWriter(output);
    writer.writeTokens(newTokens);
    output.close();
    pdPage.setContents(newContents);

  }

java - 使用 PDFBox 删除未启用的可选内容组

2 回答 2

Related

Reference