perl - LWP::Simple 运行良好：如何在文件中存储 6000 ++ 条记录并进行一些清理？

Question

晚上好，亲爱的社区！

我想处理多个网页，有点像网络蜘蛛/爬虫。我有一些东西 - 但现在我需要一些改进的蜘蛛逻辑。见目标网址http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

更新：

多亏了两个很棒的评论，我获得了很多。现在代码运行得很好。最后一个问题：如何将数据存储到文件中......如何强制解析器将结果写入文件。这比在命令行中获取超过 6000 条记录要方便得多......如果输出是在文件中完成的，我需要做一些最终清理：查看输出：如果我们将所有输出与目标 url 进行比较 -那么确定这需要一些清理，你怎么看？！再次查看目标网址http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50

6114,7754,"Volksschule Zeil a.Mai",/Sa,"d a.Mai",(Gru,"09524/94992 09524/94997",,Volksschulen,
6115,7757,"Mittelschule Zeil - Sa","d a.Mai",Schulri,"g 
97475       Zeil","09524/94995
09524/94997",,Volksschulen,"      www.hauptschule-zeil-sand.de"
6116,3890,"Volksschule Zeilar",(Gru,"dschule)
Bgm.-Stallbauer-Str. 8
84367       Zeilar",,"08572/439
08572/920001",,Volksschulen,"      www.gs-zeilarn.de"
6117,4664,"Volksschule Zeitlar",(Gru,"dschule)
Schulstra�e 5
93197       Zeitlar",,"0941/63528
0941/68945",,Volksschulen,"      www.vs-zeitlarn.de"
6118,4818,"Mittelschule Zeitlar","Schulstra�e 5
93197       Zeitlar",,,"0941/63528
0941/68945",,Volksschulen,"      www.vs-zeitlarn.de"
6119,7684,"Volksschule Zeitlofs (Gru","dschule)
Raiffeise","Str. 36
97799       Zeitlofs",,"09746/347
09746/347",,Volksschulen,"      grundschule-zeitlofs.de"

感谢所有信息！零！

这是一个老问题：作为 1-shot 功能的一部分，似乎可以正常工作。但是，一旦我将该函数作为循环的一部分包含在内，它就不会返回任何内容……有什么关系？

先说开头：看目标http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50 这个页面已经有6000多个结果了！那么我如何得到所有的结果？我使用模块 LWP::simple 并且我需要一些改进的参数，我可以使用这些参数来获取所有 6150 条记录......我有一个来自非常支持成员 tadmic 的代码（参见这个论坛） - 和这基本上运行得很好。但是在添加了一些行之后 - （目前）它会吐出一些错误。

尝试：这里是前 5 个页面的 URL：

http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=0 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=50 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=100 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=150 
http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=200

我们可以看到 URL 中的“s”属性从第 1 页的 0 开始，然后每页增加 50。我们可以使用这些信息来创建一个循环：

#!/usr/bin/perl  
use warnings;  
use strict;  
use LWP::Simple;  
use HTML::TableExtract;  
use Text::CSV;  

my @cols = qw(  
    rownum  
    number  
    name  
    phone  
    type  
    website  
);  

my @fields = qw(  
    rownum  
    number  
    name  
    street  
    postal  
    town  
    phone  
    fax  
    type  
    website  
);  

my $i_first = "0";   
my $i_last = "6100";   
my $i_interval = "50";   

for (my $i = $i_first; $i <= $i_last; $i += $i_interval) {   
    my $html = get("http://192.68.214.70/km/asps/schulsuche.asp?q=e&a=50&s=$i");   
    $html =~ tr/r//d;     # strip the carriage returns  
    $html =~ s/&nbsp;/ /g; # expand the spaces  

    my $te = new HTML::TableExtract();  
    $te->parse($html);  

    my $csv = Text::CSV->new({ binary => 1 });  

    foreach my $ts ($te->table_states) {  
        foreach my $row ($ts->rows) {  
            #trim leading/trailing whitespace from base fields  
            s/^s+//, s/\s+$// for @$row;  

            #load the fields into the hash using a "hash slice"  
            my %h;  
            @h{@cols} = @$row;  

            #derive some fields from base fields, again using a hash slice  
            @h{qw/name street postal town/} = split /n+/, $h{name};  
            @h{qw/phone fax/} = split /n+/, $h{phone};  

            #trim leading/trailing whitespace from derived fields  
            s/^s+//, s/\s+$// for @h{qw/name street postal town/};  

            $csv->combine(@h{@fields});  
            print $csv->string, "\n";  
        }  
    } 
}

我测试了代码并得到以下结果：

顺便说一句：这里是第 57 和 58 行： ...命令行告诉我这里有错误..：

    #trim leading/trailing whitespace from derived fields  
        s/^s+//, s/\s+$// for @h{qw/name street postal town/};

你怎么看？是否缺少一些反斜杠！？如何修复和测试运行代码以使结果正确！？

期待你的零

查看我得到的错误：

    Ot",,,Telefo,Fax,Schulat,Webseite                                                          Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.                                                                                        "lfd. N.",Schul-numme,Schul,"ame                                                                           
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame
    Sta�e
    PLZ 
    Ot",,,Telefo,Fax,Schulat,Webseite
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
Use of uninitialized value $_ in substitution (s///) at bavaria_all_guru.pl line 58.
"lfd. N.",Schul-numme,Schul,"ame

score 3 · Accepted Answer

每当$_isundef和涉及它的替换发生时，就会出现这些警告。s///构造隐式地作用于$_. defined解决方案是在尝试替换之前检查是否。

除此之外，虽然与警告无关，但您的正则表达式中有一个逻辑错误：

s/^s+//, s/\s+$// for @h{qw/name street postal town/};

请注意第一个构造中的缺失\。

消除错误并简化：

defined and s{^ \s+ | \s+ $}{}gx for @h{qw/name street postal town/};

为了输出到文件，在for循环之前添加以下内容：

open my $fh, '>', '/path/to/output/file' or die $!;

代替：

print $csv->string, "\n";

和：

print $fh $csv->string, "\n";

这是从print LISTto的句法变化print FILEHANDLE LIST。

参考：

score 3 · Accepted Answer

正如您所说，此行不会删除回车：

$html =~ tr/r//d;     # strip the carriage returns

你会需要：

$html =~ tr/\r//d;     # strip the carriage returns

甚至可能：

$html =~ tr/\r\n//d;     # strip the carriage returns

score 1 · Accepted Answer

如果您尝试从页面中提取链接，请使用 WWW::Mechanize，它是 LWP 的包装器，可以正确解析 HTML 以获取链接，以及为人们抓取网页提供的无数其他便利工具。

perl - LWP::Simple 运行良好：如何在文件中存储 6000 ++ 条记录并进行一些清理？

3 回答 3

参考：

Related

Reference