2

晚上好,亲爱的社区!

我想处理多个网页,有点像网络蜘蛛/爬虫。我有一些东西 - 但现在我需要一些改进的蜘蛛逻辑。见目标网址http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

更新:

多亏了两个很棒的评论,我获得了很多。现在代码运行得很好。最后一个问题:如何将数据存储到文件中......如何强制解析器将结果写入文件。这比在命令行中获取超过 6000 条记录要方便得多......如果输出是在文件中完成的,我需要做一些最终清理:查看输出:如果我们将所有输出与目标 url 进行比较 -那么确定这需要一些清理,你怎么看?!再次查看目标网址http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

6114,7754,"Volksschule Zeil a.Mai",/Sa,"d a.Mai",(Gru,"09524/94992 09524/94997",,Volksschulen,
6115,7757,"Mittelschule Zeil - Sa","d a.Mai",Schulri,"g 
97475       Zeil","09524/94995
09524/94997",,Volksschulen,"      www.hauptschule-zeil-sand.de"
6116,3890,"Volksschule Zeilar",(Gru,"dschule)
Bgm.-Stallbauer-Str. 8
84367       Zeilar",,"08572/439
08572/920001",,Volksschulen,"      www.gs-zeilarn.de"
6117,4664,"Volksschule Zeitlar",(Gru,"dschule)
Schulstra�e 5
93197       Zeitlar",,"0941/63528
0941/68945",,Volksschulen,"      www.vs-zeitlarn.de"
6118,4818,"Mittelschule Zeitlar","Schulstra�e 5
93197       Zeitlar",,,"0941/63528
0941/68945",,Volksschulen,"      www.vs-zeitlarn.de"
6119,7684,"Volksschule Zeitlofs (Gru","dschule)
Raiffeise","Str. 36
97799       Zeitlofs",,"09746/347
09746/347",,Volksschulen,"      grundschule-zeitlofs.de"

感谢所有信息!零!

这是一个老问题:作为 1-shot 功能的一部分,似乎可以正常工作。但是,一旦我将该函数作为循环的一部分包含在内,它就不会返回任何内容……有什么关系?

先说开头:看目标http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50 这个页面已经有6000多个结果了!那么我如何得到所有的结果?我使用模块 LWP::simple 并且我需要一些改进的参数,我可以使用这些参数来获取所有 6150 条记录......我有一个来自非常支持成员 tadmic 的代码(参见这个论坛) - 和这基本上运行得很好。但是在添加了一些行之后 - (目前)它会吐出一些错误。

尝试:这里是前 5 个页面的 URL:

http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200

我们可以看到 URL 中的“s”属性从第 1 页的 0 开始,然后每页增加 50。我们可以使用这些信息来创建一个循环:

#!/usr/bin/perl  
use warnings;  
use strict;  
use LWP::Simple;  
use HTML::TableExtract;  
use Text::CSV;  

my @cols = qw(  
    rownum  
    number  
    name  
    phone  
    type  
    website  
);  

my @fields = qw(  
    rownum  
    number  
    name  
    street  
    postal  
    town  
    phone  
    fax  
    type  
    website  
);  

my $i_first = "0";   
my $i_last = "6100";   
my $i_interval = "50";   

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {   
    my $html = get("http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i");   
    $html =~ tr/r//d;     # strip the carriage returns  
    $html =~ s/&nbsp;/ /g; # expand the spaces  

    my $te = new HTML::TableExtract();  
    $te->parse($html);  

    my $csv = Text::CSV->new({ binary => 1 });  

    foreach my $ts ($te->table_states) {  
        foreach my $row ($ts->rows) {  
            #trim leading/trailing whitespace from base fields  
            s/^s+//, s/\s+$// for @$row;  

            #load the fields into the hash using a "hash slice"  
            my %h;  
            @h{@cols} = @$row;  

            #derive some fields from base fields, again using a hash slice  
            @h{qw/name street postal town/} = split /n+/, $h{name};  
            @h{qw/phone fax/} = split /n+/, $h{phone};  

            #trim leading/trailing whitespace from derived fields  
            s/^s+//, s/\s+$// for @h{qw/name street postal town/};  

            $csv->combine(@h{@fields});  
            print $csv->string, "\n";  
        }  
    } 
}

我测试了代码并得到以下结果:

顺便说一句:这里是第 57 和 58 行: ...命令行告诉我这里有错误..:

    #trim leading/trailing whitespace from derived fields  
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};  

你怎么看?是否缺少一些反斜杠!?如何修复和测试运行代码以使结果正确!?

期待你的零

查看我得到的错误:

    Ot",,,Telefo,Fax,Schulat,Webseite                                                          Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        "lfd. N.",Schul-numme,Schul,"ame                                                                           
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
4

3 回答 3

3

每当$_isundef和涉及它的替换发生时,就会出现这些警告。s///构造隐式地作用于$_. defined解决方案是在尝试替换之前检查是否。

除此之外,虽然与警告无关,但您的正则表达式中有一个逻辑错误:

s/^s+//, s/\s+$// for @h{qw/name street postal town/};

请注意第一个构造中的缺失\

消除错误并简化:

defined and s{^ \s+ | \s+ $}{}gx for @h{qw/name street postal town/};

为了输出到文件,在for循环之前添加以下内容:

open my $fh, '>', '/path/to/output/file' or die $!;

代替:

print $csv->string, "\n";

和:

print $fh $csv->string, "\n";

这是从print LISTto的句法变化print FILEHANDLE LIST

参考:

于 2011-02-26T15:05:29.090 回答
3

正如您所说,此行不会删除回车:

$html =~ tr/r//d;     # strip the carriage returns  

你会需要:

$html =~ tr/\r//d;     # strip the carriage returns

甚至可能:

$html =~ tr/\r\n//d;     # strip the carriage returns  
于 2011-02-26T15:10:10.567 回答
1

如果您尝试从页面中提取链接,请使用 WWW::Mechanize,它是 LWP 的包装器,可以正确解析 HTML 以获取链接,以及为人们抓取网页提供的无数其他便利工具。

于 2011-02-26T20:16:42.307 回答