perl - LWP::Simple - 如何在其中实现循环[带现场演示]

Question

晚上好，亲爱的社区！

我想处理多个网页，有点像网络蜘蛛/爬虫。我有一些东西 - 但现在我需要一些改进的蜘蛛逻辑。见目标网址http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

此页面已获得 6000 多个结果！那么我如何得到所有的结果？我使用模块 LWP::simple 并且我需要一些改进的参数，我可以使用这些参数来获取所有 6150 条记录

尝试：这里是前 5 个页面的 URL：

http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200

我们可以看到 URL 中的“s”属性从第 1 页的 0 开始，然后每页增加 50。我们可以使用这些信息来创建一个循环：

my $i_first = "0"; 
my $i_last = "6100"; 
my $i_interval = "50"; 

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
     my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
     #process pageurl 
}

tadmc（一个非常支持用户）创建了一个很棒的脚本，可以输出 cvs 格式的结果。我已经在代码中构建了这个循环：（注意 - 我想出了点问题！请参阅下面的思考......带有代码片段和错误消息：

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;

my $i_first = "0"; 
my $i_last = "6100"; 
my $i_interval = "50"; 

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
     my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
     #process pageurl 
}

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/&nbsp;/ /g; # expand the spaces

my $te = new HTML::TableExtract();
$te->parse($html);

my @cols = qw(
    rownum
    number
    name
    phone
    type
    website
);

my @fields = qw(
    rownum
    number
    name
    street
    postal
    town
    phone
    fax
    type
    website
);

my $csv = Text::CSV->new({ binary => 1 });

foreach my $ts ($te->table_states) {
    foreach my $row ($ts->rows) {

trim leading/trailing whitespace from base fields
        s/^s+//, s/\s+$// for @$row;

load the fields into the hash using a "hash slice"
        my %h;
        @h{@cols} = @$row;

derive some fields from base fields, again using a hash slice
        @h{qw/name street postal town/} = split /n+/, $h{name};
        @h{qw/phone fax/} = split /n+/, $h{phone};

trim leading/trailing whitespace from derived fields
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};

        $csv->combine(@h{@fields});
        print $csv->string, "\n";
    }
}

有一些问题 - 我犯了一个错误，我猜错误就在这里：

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
 my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
      #process pageurl 
    }

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/&nbsp;/ /g; # expand the spaces

我已经写下了某种双码。我需要省略一部分...这里的这个

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/&nbsp;/ /g; # expand the spaces

在命令行中查看结果：

martin@suse-linux:~> cd perl
martin@suse-linux:~/perl> perl bavaria_all_.pl
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
syntax error at bavaria_all_.pl line 59, near "/,"
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 59.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
Substitution replacement not terminated at bavaria_all_.pl line 63.
martin@suse-linux:~/perl>

你怎么看！？期待收到你的回复

顺便说一句 - 查看代码，由 tadmc 创建，没有任何改进的蜘蛛逻辑......这运行非常非常niely - 没有任何问题：它吐出一个很好的格式化 cvs 输出！

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/&nbsp;/ /g; # expand the spaces

my $te = new HTML::TableExtract();
$te->parse($html);

my @cols = qw(
    rownum
    number
    name
    phone
    type
    website
);

my @fields = qw(
    rownum
    number
    name
    street
    postal
    town
    phone
    fax
    type
    website
);

my $csv = Text::CSV->new({ binary => 1 });

foreach my $ts ($te->table_states) {
    foreach my $row ($ts->rows) {

trim leading/trailing whitespace from base fields
        s/^s+//, s/\s+$// for @$row;

load the fields into the hash using a "hash slice"
        my %h;
        @h{@cols} = @$row;

derive some fields from base fields, again using a hash slice
        @h{qw/name street postal town/} = split /n+/, $h{name};
        @h{qw/phone fax/} = split /n+/, $h{phone};

trim leading/trailing whitespace from derived fields
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};

        $csv->combine(@h{@fields});
        print $csv->string, "\n";
    }
}

注意：上面提到的代码运行良好 - 它会输出 csv 格式的输出。

score 1 · Accepted Answer

出色的！我在等你弄清楚如何自己获取多个页面！

1）将我的代码放在页面获取循环中（将“}”方式向下移动到最后）。

2) $html = 获取 $pageurl; # 更改此项以使用您的新 URL

3) 将我的反斜杠放回原来的位置：tr/\r//d;

score 1 · Accepted Answer

实现分页的另一种方法是从页面中提取所有 URL 并检测寻呼机 URL。

... 
for (@urls) { 
    if (is_pager_url($_) and not exists $seen{$_}) {
         push @pager_url, $_; 
         $seen{$_}++; 
    }
}
... 

sub is_pager_url { 
    my ($url) = @_; 
    return 1 if $url =~ m{schulsuche.asp\?q=e\&a=\d+\&s=\d+};
}

这样您就不必处理递增计数器或确定总页数。它也适用于 a 和 s 的不同值。通过保留 %seen 散列，您可以廉价地避免区分上一页和下一页。

perl - LWP::Simple - 如何在其中实现循环[带现场演示]

2 回答 2

Related

Reference