我希望能够检测 PDF 中的模式并以某种方式标记它。
例如,在这个 PDF中,有 string *2
。我希望能够解析 PDF,检测 的所有实例*[integer]
,并做一些事情来引起对匹配的注意(比如将它们突出显示为黄色或在边距中添加一个符号)。
我更愿意在 Python 中执行此操作,但我对其他语言持开放态度。到目前为止,我已经能够使用pyPdf来阅读 PDF 的文本。我可以使用正则表达式来检测模式。但我无法弄清楚如何标记匹配并重新保存 PDF。
要么人们不感兴趣,要么 Python 没有能力,所以这是 Perl 中的解决方案:-)。说真的,如上所述,您不需要“更改字符串”。PDF 注释是您的解决方案。不久前我有一个带有注释的小项目,一些代码来自那里。但是,我的内容解析器不是通用的,你不需要完整的解析——这意味着能够改变内容并将其写回。因此我求助于外部工具。我使用的 PDF 库有点低级,但我不介意。这也意味着,一个人应该对 PDF 内部有适当的了解,以了解正在发生的事情。否则,只需使用该工具。
这是使用命令标记例如 OP 文件中的所有动名词的镜头
perl pdf_hl.pl -f westlaw.pdf -p '\S*ing'
代码(里面的评论也值得一读):
use strict;
use warnings;
use XML::Simple;
use CAM::PDF;
use Getopt::Long;
use Regexp::Assemble;
#####################################################################
#
# This is PDF highlight mark-up tool.
# Though fully functional, it's still a prototype proof-of-concept.
# Please don't feed it with non-pdf files or patterns like '\d*'
# (because you probably want '\d+', don't you?).
#
# Requires muPDF-tools installed and in the PATH, plus some CPAN modules.
#
# ToDo:
# - error handling is primitive if any.
# - cropped files (CropBox) are processed incorrectly. Fix it.
# - of course there can be other useful parameters.
# - allow loading them from file.
# - allow searching across lines (e.g. for multi-word patterns)
# and certainly across "spans" within a line (see mudraw output).
# - multi-color mark-up, not just yellow.
# - control over output file name.
# - compress output (use cleanoutput method instead of output,
# plus more robust (think compressed object streams) compressors
# may be useful).
# - file list processing.
# - annotations are not just colorful marks on the page, their
# dictionaries can contain all sorts of useful information, which may
# be extracted automatically further up the food chain i.e. by
# whoever consumes these files (date, time, author, comments, actual
# text below, etc., etc., plus think of customized appearence streams,
# placing them on layers, etc..
# - ???
#
# Most complexity in the code comes from adding appearance
# dictionary (AP). You can safely delete it, because most viewers don't
# need AP for standard annotations. Ironically, muPDF-viewer wants it
# (otherwise highlight placement is not 100% correct), and since I relied
# on muPDF-tools, I thought it be proper to create PDFs consumable by
# their viewer... Firefox wants AP too, btw.
#
#####################################################################
my ($file, $csv);
my ($c_flag, $w_flag) = (0, 1);
GetOptions('-f=s' => \$file, '-p=s' => \$csv,
'-c!' => \$c_flag, '-w!' => \$w_flag)
and defined($file)
and defined($csv)
or die "\nUsage: perl $0 -f FILE -p LIST -c -w\n\n",
"\t-f\t\tFILE\t PDF file to annotate\n",
"\t-p\t\tLIST\t comma-separated patterns\n",
"\t-c or -noc\t\t be case sensitive (default = no)\n",
"\t-w or -now\t\t whole words only (default = yes)\n";
my $re = Regexp::Assemble->new
->add(split(',', $csv))
->anchor_word($w_flag)
->flags($c_flag ? '' : 'i')
->re;
my $xml = qx/mudraw -ttt $file/;
my $tree = XMLin($xml, ForceArray => [qw/page block line span char/]);
my $pdf = CAM::PDF->new($file);
sub __num_nodes_list {
my $precision = shift;
[ map {CAM::PDF::Node->new('number', sprintf("%.${precision}f", $_))} @_ ]
}
sub add_highlight {
my ($idx, $x1, $y1, $x2, $y2) = @_;
my $p = $pdf->getPage($idx);
# mirror vertically to get to normal cartesian plane
my ($X1, $Y1, $X2, $Y2) = $pdf->getPageDimensions($idx);
($x1, $y1, $x2, $y2) = ($X1 + $x1, $Y2 - $y2, $X1 + $x2, $Y2 - $y1);
# corner radius
my $r = 2;
# AP appearance stream
my $s = "/GS0 gs 1 1 0 rg 1 1 0 RG\n";
$s .= "1 j @{[sprintf '%.0f', $r * 2]} w\n";
$s .= "0 0 @{[sprintf '%.1f', $x2 - $x1]} ";
$s .= "@{[sprintf '%.1f',$y2 - $y1]} re B\n";
my $highlight = CAM::PDF::Node->new('dictionary', {
Subtype => CAM::PDF::Node->new('label', 'Highlight'),
Rect => CAM::PDF::Node->new('array',
__num_nodes_list(1, $x1 - $r, $y1 - $r, $x2 + $r * 2, $y2 + $r * 2)),
QuadPoints => CAM::PDF::Node->new('array',
__num_nodes_list(1, $x1, $y2, $x2, $y2, $x1, $y1, $x2, $y1)),
BS => CAM::PDF::Node->new('dictionary', {
S => CAM::PDF::Node->new('label', 'S'),
W => CAM::PDF::Node->new('number', 0),
}),
Border => CAM::PDF::Node->new('array',
__num_nodes_list(0, 0, 0, 0)),
C => CAM::PDF::Node->new('array',
__num_nodes_list(0, 1, 1, 0)),
AP => CAM::PDF::Node->new('dictionary', {
N => CAM::PDF::Node->new('reference',
$pdf->appendObject(undef,
CAM::PDF::Node->new('object',
CAM::PDF::Node->new('dictionary', {
Subtype => CAM::PDF::Node->new('label', 'Form'),
BBox => CAM::PDF::Node->new('array',
__num_nodes_list(1, -$r, -$r, $x2 - $x1 + $r * 2,
$y2 - $y1 + $r * 2)),
Resources => CAM::PDF::Node->new('dictionary', {
ExtGState => CAM::PDF::Node->new('dictionary', {
GS0 => CAM::PDF::Node->new('dictionary', {
BM => CAM::PDF::Node->new('label',
'Multiply'),
}),
}),
}),
StreamData => CAM::PDF::Node->new('stream', $s),
Length => CAM::PDF::Node->new('number', length $s),
}),
),
,0),
),
}),
});
$p->{Annots} ||= CAM::PDF::Node->new('array', []);
push @{$pdf->getValue($p->{Annots})}, $highlight;
$pdf->{changes}->{$p->{Type}->{objnum}} = 1
}
my $page_index = 1;
for my $page (@{$tree->{page}}) {
for my $block (@{$page->{block}}) {
for my $line (@{$block->{line}}) {
for my $span (@{$line->{span}}) {
my $string = join '', map {$_->{c}} @{$span->{char}};
while ($string =~ /$re/g) {
my ($x1, $y1) =
split ' ', $span->{char}->[$-[0]]->{bbox};
my (undef, undef, $x2, $y2) =
split ' ', $span->{char}->[$+[0] - 1]->{bbox};
add_highlight($page_index, $x1, $y1, $x2, $y2)
}
}
}
}
$page_index ++
}
$pdf->output($file =~ s/(.{4}$)/++$1/r);
__END__
Ps 我用“Perl”标记了这个问题,也许可以从社区获得一些反馈(代码更正等)。