0

我有以下网页一部分的 html 代码。

<h2 id="failed_process">Failed Process</h2>
<table border="1">
  <thead>
    <tr>
      <th>
        <b>pid</b>
      </th>
      <th>
        <b>Priority</b>
      </th>
    </tr>
  </thead>
  <tbody>
    <tr>
      <td id="90"><a href="details.jsp?pid=p_201211162334&refresh=0">p_201211162334</a></td>
      <td id="priority_90">NORMAL</td>
    </tr>
    <tr>
      <td id="91"><a href="details.jsp?pid=p_201211163423&refresh=0">p_201211163423</a></td>
      <td id="priority_91">NORMAL</td>
    </tr>
    <tr>
      <td id="98"><a href="details.jsp?pid=p_201211166543&refresh=0">p_201211166543</a></td>
      <td id="priority_98">NORMAL</td>
    </tr>
  </tbody>
</table>
<hr>

我需要提取 pid 列。输出应该看起来像

pid
p_201211162334
p_201211163423
p_201211166543

“失败进程”表的表数为 4。但问题是,如果我提到表数为 4,并且网页中没有失败的任务,它将转到下一个表并获取下一个表的 pid导致错误的pid。

我正在使用下面的代码来获得结果。

#!/usr/bin/perl
 use strict; 
 use warnings;
 use lib qw(..);
 use HTML::TableExtract;

 my $content = get("URL");
 my $te = HTML::TableExtract->new(
 headers => [qw(pid)], attribs => { id => 'failed_process' },
 );

 $te->parse($content);

 foreach my $col ($te->rows) {
 print ("\t", @$col), "\n";
 }

但我收到以下错误:

Can't call method "rows" on an undefined value 
4

2 回答 2

1

使用Mojolicious 套件中我最喜欢的 DOM 解析器Mojo::DOM,它看起来像这样:

#!/usr/bin/env perl

use strict;
use warnings;
use feature 'say';
use Mojo::DOM;

# instantiate with all DATA lines
my $dom = Mojo::DOM->new(do { local $/; <DATA> });

# extract all first column cells
$dom->find('table tr')->each(sub {
    my $cell = shift->children->[0];
    say $cell->all_text;
});

__DATA__
<h2 id="failed_process">Failed Process</h2>
<table border="1">
    ...

输出:

pid
p_201211162334
p_201211163423
p_201211166543
于 2013-01-07T13:10:06.610 回答
0

$te->parse($html)你可以添加一些foreach my $table ($te->tables) ..之后你可以得到 rows $table->rows。您也可以使用Data::Dumper来分析$te.

于 2013-01-07T10:18:04.727 回答