0

我很确定这真的很基本。但是我不了解 Perl,只需要使用它一次。所以我很感谢你的耐心。

我正在尝试从 HTML 中的单行中删除不需要的文本:

    <a target="_blank"          href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List (<i>Revised<i>)</a> 

我想要留下的Run Printable TCI List (<i>Revised</i>)只是</a>. 我有大约 500 行这样的行,因为它们将来可能会改变,所以创建一个程序是有意义的。到目前为止,以下是我的 Perl 代码:

open (SEARK, 'C:\\HTMLsorter\\sources.txt');
open (OUTSEARK, '>C:\\HTMLsorter\\outseark.txt');
while(<SEARK>) {
  chomp;

  if ($_=~/<a target/) {
    $_ =~ s/\<i>//g;
    $_ =~ s/\<\/i>//g;
    @itemsa = split(/>/);
    @itemsb = split(/</, $itemsa[1]);
    print OUTSEARK ("$itemsb[0]\n");
  }
}
close (SEARK);
close (OUTSEARK);

我相信您可以阅读此内容,但只是为了解释我正在打开一个名为的文件sources.txt,其中有 500 行要排序。输出文件将是outseark.txt. 到目前为止,它将输出:

Run Printable TCI List (Revised)

这显然是由于针对箭头内部和周围的所有东西的分裂。任何想法如何将斜体保持在括号内?留下:

Run Printable TCI List (<i>Revised<i>)

感谢您的关注。

4

3 回答 3

1
#!/usr/bin/perl
use strict;
use warnings;

open IFH, '<myfile.txt';
open OFH, '>output.txt';

while (<IFH>) {
  if (/<a\s+target.*?>(.*?)<\/a>/i)
  {
    $_ = $1;
    s/<.*?>//g;
    print OFH "$_\n";
  }
}

close IFH;
close OFH;
于 2012-04-11T13:45:57.270 回答
0

You could do this in one liner.

cat inputfile|perl -ne 'if (s#<a\s+target[^>]+>(.+?)</a>##is){print "$1\n";}'>outputfile

It is working:

echo '<a target="_blank"          href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List (<i>Revised<i>)</a>
<a target="_blank"          href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List 1(<i>Revised<i>)</a>
<a target="_blank"          href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List 2(<i>Revised<i>)</a>
<a target="_blank"          href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List 3(<i>Revised<i>)</a>'|\
perl -ne 'if (s#<a\s+target[^>]+>(.+?)</a>##is){print "$1\n";}'

Run Printable TCI List (<i>Revised<i>)
Run Printable TCI List 1(<i>Revised<i>)
Run Printable TCI List 2(<i>Revised<i>)
Run Printable TCI List 3(<i>Revised<i>)
于 2012-04-11T12:25:56.337 回答
0

您应该使用适当的 HTML 解析器,例如HTML::TreeBuilder. 代码不再像这个程序演示的那样复杂

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file(*DATA);

print $_->as_text, "\n" for $tree->look_down(_tag => 'a', target => qr/./);

__DATA__
    <a target="_blank"          href="http://sharepoint/sites/cerner/quickreferenceguides/Documents/EXP001_Run_Printable_TCI_List.pdf" onmouseover="return overlib('This guide outlines the process for running a printable TCI List', CAPTION, 'TCI LIST');" onmouseout="return nd();">Run Printable TCI List (<i>Revised<i>)</a> 

输出

Run Printable TCI List (Revised)

编辑

要在示例中的文件上使用此技术,代码如下所示

use strict;
use warnings;

use HTML::TreeBuilder;

my $tree = HTML::TreeBuilder->new_from_file('C:\HTMLsorter\sources.txt');

open my $out, '>', 'C:\HTMLsorter\outseark.txt' or die $!;

print $out $_->as_text, "\n" for $tree->look_down(_tag => 'a', target => qr/./);

编辑 2

现在我更好地了解您的需求,我可以提供这种替代解决方案。它使用该HTML::DOM模块来访问 HTML 文档的文档对象模型,因为获得所需的结果HTML::TreeBuilder相对困难。

我还注意到您的示例 HTML 包含<i>Revised<i>的内容显然应该是<i>Revised</i>,并且我已针对此示例测试对其进行了更正。无论如何,Perl 试图像浏览器一样解析错误的 HTML,即使出现错误,输出仍然可用。

use strict;
use warnings;

use HTML::DOM;

my $dom = HTML::DOM->new;
$dom->parse_file('C:\HTMLsorter\sources.txt') or die $!;

open my $out, '>', 'C:\HTMLsorter\outseark.txt' or die $!;
print $out $_->innerHTML, "\n" for grep $_->attr('target'), $dom->getElementsByTagName('a');

输出

(标签已更正)

Run Printable TCI List (<i>Revised</i>)

(带有原始标签)

Run Printable TCI List (<i>Revised<i>)</i></i>
于 2012-04-11T16:25:32.953 回答