0

我尝试使用 CAM::PDF 在 PERL 中解析以下文件

http://www.roehampton.ac.uk/uploadedFiles/Pages_Assets/PDFs_and_Word_Docs/Human_Resources/VL%20Advert%20Biomedical%20Sciences%20Sep%2012.pdf

但是当我打开 PDF 时,我似乎得到了很多换行符。这是我的示例代码的快照。

        my $file_name = 'file_3.pdf';
        my $filecontent;
        my @lines = '';
        my $save = "/home/tejas/Projects/Richmond/pdf/";
        $file_name = $save . $file_name;
        my $doc = CAM::PDF->new($file_name) || die "$CAM::PDF::errstr\n";

foreach my $p ( 1 .. $doc->numPages() ) {
    my $str = $doc->getPageText($p);
    if (defined $str) {
       CAM::PDF->asciify(\$str);
       print  $str;
    }
}

我已经从file_3.pdf的链接下载并存储了pdf。请让我知道在解析时是否可以做任何更好的事情以将一些行拼接在一起(尤其是那些在单词中间断开的行)。

4

1 回答 1

1

I ran this little script:

$ perl -MCAM::PDF -Mstrict - ~/Downloads/perldata.pdf 
my $doc = CAM::PDF->new($ARGV[0]) or die;
my $str = $doc->getPageText(1);
CAM::PDF->asciify(\$str);
my @blocks = split /\n\s*\n\s*\n/, $str;
foreach (@blocks) {
  $_=~ s/\s*\n\s*/ /g;
  print $_, "\n\n";
}
__END__

I split the file into blocks or paragraphs at consecutive empty lines. Then, I remove all newlines (with surrounding spaces) inside this paragraph. If we replace it with a space (as I did above), we get weird spaces. If we use the regex s/\n//g instead, some words are run together where there should be spaces. But both possibilities are quite readable nevertheless, try it out.

It is not easily posible to get an ideal solution. Keep in mind that the PDF format is all about the graphical representation of documents and not about semantics.

The first few lines look like this:

Department of Life Sciences

Visiting Lecturer s (1.5 FTE) in B iomedical S cience s

The popularity [...]
于 2012-09-11T01:36:37.250 回答