0

晚上好,亲爱的社区!

我想处理多个网页,有点像网络蜘蛛/爬虫。我有一些东西 - 但现在我需要一些改进的蜘蛛逻辑。见目标网址http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

此页面已获得 6000 多个结果!那么我如何得到所有的结果?我使用模块 LWP::simple 并且我需要一些改进的参数,我可以使用这些参数来获取所有 6150 条记录

尝试:这里是前 5 个页面的 URL:

http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200

我们可以看到 URL 中的“s”属性从第 1 页的 0 开始,然后每页增加 50。我们可以使用这些信息来创建一个循环:

my $i_first = "0"; 
my $i_last = "6100"; 
my $i_interval = "50"; 

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
     my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
     #process pageurl 
}

tadmc(一个非常支持用户)创建了一个很棒的脚本,可以输出 cvs 格式的结果。我已经在代码中构建了这个循环:(注意 - 我想出了点问题!请参阅下面的思考......带有代码片段和错误消息:

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;

my $i_first = "0"; 
my $i_last = "6100"; 
my $i_interval = "50"; 

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
     my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
     #process pageurl 
}

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/&nbsp;/ /g; # expand the spaces

my $te = new HTML::TableExtract();
$te->parse($html);

my @cols = qw(
    rownum
    number
    name
    phone
    type
    website
);

my @fields = qw(
    rownum
    number
    name
    street
    postal
    town
    phone
    fax
    type
    website
);

my $csv = Text::CSV->new({ binary => 1 });

foreach my $ts ($te->table_states) {
    foreach my $row ($ts->rows) {

trim leading/trailing whitespace from base fields
        s/^s+//, s/\s+$// for @$row;

load the fields into the hash using a "hash slice"
        my %h;
        @h{@cols} = @$row;

derive some fields from base fields, again using a hash slice
        @h{qw/name street postal town/} = split /n+/, $h{name};
        @h{qw/phone fax/} = split /n+/, $h{phone};

trim leading/trailing whitespace from derived fields
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};

        $csv->combine(@h{@fields});
        print $csv->string, "\n";
    }
} 

有一些问题 - 我犯了一个错误,我猜错误就在这里:

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) { 
 my $pageurl = "http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i"; 
      #process pageurl 
    }

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/&nbsp;/ /g; # expand the spaces

我已经写下了某种双码。我需要省略一部分...这里的这个

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/&nbsp;/ /g; # expand the spaces

在命令行中查看结果:

martin@suse-linux:~> cd perl
martin@suse-linux:~/perl> perl bavaria_all_.pl
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
Possible unintended interpolation of %h in string at bavaria_all_.pl line 52.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 52.
syntax error at bavaria_all_.pl line 59, near "/,"
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 59.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
Global symbol "%h" requires explicit package name at bavaria_all_.pl line 60.
Substitution replacement not terminated at bavaria_all_.pl line 63.
martin@suse-linux:~/perl> 

你怎么看!?期待收到你的回复

顺便说一句 - 查看代码,由 tadmc 创建,没有任何改进的蜘蛛逻辑......这运行非常非常niely - 没有任何问题:它吐出一个很好的格式化 cvs 输出!

#!/usr/bin/perl
use warnings;
use strict;
use LWP::Simple;
use HTML::TableExtract;
use Text::CSV;

my $html= get 'http://192.68.214.70/km/asps/schulsuche.asp?q=n&a=50';
$html =~ tr/r//d;     # strip the carriage returns
$html =~ s/&nbsp;/ /g; # expand the spaces

my $te = new HTML::TableExtract();
$te->parse($html);

my @cols = qw(
    rownum
    number
    name
    phone
    type
    website
);

my @fields = qw(
    rownum
    number
    name
    street
    postal
    town
    phone
    fax
    type
    website
);

my $csv = Text::CSV->new({ binary => 1 });

foreach my $ts ($te->table_states) {
    foreach my $row ($ts->rows) {

trim leading/trailing whitespace from base fields
        s/^s+//, s/\s+$// for @$row;

load the fields into the hash using a "hash slice"
        my %h;
        @h{@cols} = @$row;

derive some fields from base fields, again using a hash slice
        @h{qw/name street postal town/} = split /n+/, $h{name};
        @h{qw/phone fax/} = split /n+/, $h{phone};

trim leading/trailing whitespace from derived fields
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};

        $csv->combine(@h{@fields});
        print $csv->string, "\n";
    }
} 

注意:上面提到的代码运行良好 - 它会输出 csv 格式的输出。

4

2 回答 2

1

出色的!我在等你弄清楚如何自己获取多个页面!

1)将我的代码放在页面获取循环中(将“}”方式向下移动到最后)。

2) $html = 获取 $pageurl; # 更改此项以使用您的新 URL

3) 将我的反斜杠放回原来的位置:tr/\r//d;

于 2011-02-26T00:04:29.777 回答
1

实现分页的另一种方法是从页面中提取所有 URL 并检测寻呼机 URL。

... 
for (@urls) { 
    if (is_pager_url($_) and not exists $seen{$_}) {
         push @pager_url, $_; 
         $seen{$_}++; 
    }
}
... 

sub is_pager_url { 
    my ($url) = @_; 
    return 1 if $url =~ m{schulsuche.asp\?q=e\&a=\d+\&s=\d+};
}

这样您就不必处理递增计数器或确定总页数。它也适用于 a 和 s 的不同值。通过保留 %seen 散列,您可以廉价地避免区分上一页和下一页。

于 2011-02-26T00:13:18.860 回答