perl - 如何使用 CAM::PDF 从 PDF 中删除所有图像而不损坏它？

Question

下面的脚本能够使用CAM::PDF. 但是，输出已损坏。PDF 阅读器仍然可以打开它，但他们抱怨错误。例如，mupdf说：

error: no XObject subtype specified
error: cannot draw xobject/image
warning: Ignoring errors during rendering
mupdf: warning: Errors found on page

现在，CAM::PDFCPAN 上的页面（此处）列出了deleteObject()“更深层次的实用程序”下的方法，大概意味着它不打算供公众使用。此外，它警告说：

此函数不处理对此对象的依赖关系。

我的问题是：从 PDF 文件中删除对象的正确方法是什么CAM::PDF？如果问题与依赖关系有关，如何在处理依赖关系的同时删除对象？

有关如何使用其他工具从 PDF 中删除图像，请参阅此处的相关问题。

use CAM::PDF;    
my $pdf = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

foreach my $objnum ( sort { $a <=> $b } keys %{ $pdf->{xref} } ) {
  my $xobj = $pdf->dereference ( $objnum );

  if ( $xobj->{value}->{type} eq 'dictionary' ) {
    my $im = $xobj->{value}->{value};
    if
    (
      defined $im->{Type} and defined $im->{Subtype}
      and $pdf->getValue ( $im->{Type}    ) eq 'XObject'
      and $pdf->getValue ( $im->{Subtype} ) eq 'Image'
    )
    {
      $pdf->deleteObject ( $objnum );
    }
  }
}

$pdf->cleanoutput ( '-' );

score 4 · Accepted Answer

这使用 CAM::PDF，但采用的方法略有不同。它没有尝试删除非常困难的图像，而是将每个图像替换为透明图像。

首先，请注意，我们可以使用 image magick 生成一个只包含透明图像的空白 PDF：

% convert  -size 200x100 xc:none transparent.pdf

如果我们在文本编辑器中查看生成的 PDF，我们可以找到主要的图像对象：

8 0 obj
<<
/Type /XObject
/Subtype /Image
/Name /Im0
...

这里要注意的重要一点是，我们生成了一个透明图像作为对象编号 8。

然后导入这个对象，并使用它来替换 PDF 中的每个真实图像，有效地将它们消隐。

use warnings; use strict;
use CAM::PDF;    
my $pdf = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

my $trans_pdf = CAM::PDF->new("transparent.pdf") || die "$CAM::PDF::errstr\n";
my $trans_objnum = 8; # object number of transparent image

foreach my $objnum ( sort { $a <=> $b } keys %{ $pdf->{xref} } ) {
  my $xobj = $pdf->dereference ( $objnum );

  if ( $xobj->{value}->{type} eq 'dictionary' ) {
    my $im = $xobj->{value}->{value};
    if
    (
      defined $im->{Type} and defined $im->{Subtype}
      and $pdf->getValue ( $im->{Type}    ) eq 'XObject'
      and $pdf->getValue ( $im->{Subtype} ) eq 'Image'
    ) {
        $pdf->replaceObject ( $objnum, $trans_pdf, $trans_objnum, 1 );
    }
  }
}

$pdf->cleanoutput ( '-' );

该脚本现在将 PDF 中的每个图像替换为导入的透明图像对象（来自的对象编号 8 transparent.pdf）。

score 2 · Accepted Answer

另一种真正删除图像的方法是：

在资源列表中查找和删除图像 XObjects ，
保留一个包含已删除资源名称的数组，
用相同长度Do的空格替换每个页面内容中的相应运算符，
清理并打印。

请注意，dwarring 的方法更安全，因为它不必$doc->cleanse在最后调用。根据CAM::PDF文档（此处），该cleanse方法

删除未使用的对象。警告：此函数会破坏某些 PDF 文档，因为它会删除严格属于页面模型层次结构的对象，但无论如何都是必需的（例如某些字体定义对象）。

我不知道使用会有多大的问题cleanse。

use CAM::PDF;
my $doc = new CAM::PDF ( shift ) or die $CAM::PDF::errstr;

# delete image XObjects among resources
# but keep their names

my @names;

foreach my $objnum ( sort { $a <=> $b } keys %{ $doc->{xref} } ) {
  my $obj = $doc->dereference( $objnum );
  next unless $obj->{value}->{type} eq 'dictionary';

  my $n = $obj->{value}->{value};

  my $resources = $doc->getValue ( $n->{Resources}       ) or next;
  my $resource  = $doc->getValue ( $resources->{XObject} ) or next;

  foreach my $name ( sort keys $resource ) {
    my $im = $doc->getValue ( $resource->{$name} ) or next;

    next unless defined $im->{Type}
            and defined $im->{Subtype}
            and $doc->getValue ( $im->{Type}    ) eq 'XObject'
            and $doc->getValue ( $im->{Subtype} ) eq 'Image';

    delete $resource->{$name};                                                                                                           
    push @names, $name;                                                                                                                  
  }                                                                                                                                      
}                                                                                                                                        


# delete the corresponding Do operators                                                                                                                        

if ( @names ) {                                                                                                                                                               
  foreach my $p ( 1 .. $doc->numPages ) {                                                                                                                                     
    my $content = $doc->getPageContent ( $p );
    my $s;
    foreach my $name ( @names ) {
      ++$s if $content =~ s{( / \Q$name\E \s+ Do \b )} { ' ' x length $1 }xeg;
    }
    $doc->setPageContent ( $p, $content ) if $s;
  }
}

$doc->cleanse;
$doc->cleanoutput;

perl - 如何使用 CAM::PDF 从 PDF 中删除所有图像而不损坏它？

2 回答 2

Related

Reference