perl - 如何使用 Perl 的 LWP::UserAgent 来获取具有不同查询字符串的相同 URL？

Question

我有一个正在运行的 LWP::UserAgent 应该应用于以下 URL：

http://dms-schule.bildung.hessen.de/suchen/suche_schul_db.html?show_school=5503

这与许多类似的目标一起运行，看到以下结局：

html?show_school=5503
html?show_school=9002
html?show_school=5512

我想通过使用 LWP::UserAgent 来做到这一点：

for my $i (0..10000) 

{ $ua->get(' [here the URL should be applied] ', id => 21, extern_uid => $i); 
# process reply }

无论如何，使用这样的循环来完成这种工作是一种方法。我猜 LWP 的 API 并不是要取代核心 Perl 的功能，我可以使用 Perl 循环来查询多个 URL。

由于必须应用循环而无法运行的代码：

#use strict;

use DBI;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder::XPath;

# first get a list of all schools
my ($url = '[here the url should be applied] =',id);

for my $id (0..10000) {
  $ua->get(' [here the url should be applied ] ', id => 21, extern_uid => $i);
  # process reply
}  

#my $request = POST $url,
#                 [
#         Schulsuche=> "Ergebnisse anzeigen",
#         order => "schule_ort",
#         schulname => undef, 
#         schulort => undef, 
#         typid => "11",
#         verbinder => "AND"
#                 ];

my $ua = LWP::UserAgent->new;
print "getting all schools - this could take some time\n";
my $response = $ua->request($request);

# extract the ids
my @ids = $response->content =~ /getSchoolDetail\((\d+)/gs;
print "found " . scalar @ids . " schools\n";

# for this demo we only do the first 5
my @ids_to_do = @ids[0..4];

# use your own user and password
my $dbh = DBI->connect("DBI:mysql:database=schulen", "user", "pass", { AutoCommit => 0 }) or die $!;

my $sth = $dbh->prepare(<<sqlend);
   insert into schulen ( name , plz , ort, strasse , tel, fax , mail, quelle , original_id )
               values  ( ?, ?, ?, ?, ?, ?, ?, ?, ? )
sqlend

# now loop over ids
for my $id (@ids_to_do) {

  # get detail information for id
  my $res = $ua->get("[url]=> &gid=$id");

  # parse the response
  my $tree = HTML::TreeBuilder::XPath->new;
  $tree->parse($res->content);

  my $xpath = q|//div[@id='MCinhview']//div[@class='contentitem']//table|;
  my ($adress_table, $tel_table) = $tree->findnodes($xpath);

  my ($adr) = $adress_table->find("td");
  my ($name, $city, $street) = map { s/^\s*//; s/\s*$//; $_ } ($adr->content_list)[2,4,6];

  my($plz, $ort) = $city =~ /^(\d+)\s*(.*)/;
  my ($tel, $fax, $mail) = map { s/^\s*//; s/\s*$//; $_ } map { ($_->content_list)[1] } $tel_table->find("td");

  $sth->execute($name, $plz, $ort, $street, $tel, $fax, $mail, "SA", $id);
  $dbh->commit;

  $tree->delete;

  print "$name done\n";
}

10 月 25 日星期日更新：我已经应用了 OmnipotentEntity 的建议。

#!/usr/bin/perl -W

use strict;
use warnings;         # give out some warnings if something does not run well
use diagnostics;      # tell me when something is wrong 
use DBI;
use LWP::UserAgent;
use HTTP::Request::Common;
use HTML::TreeBuilder::XPath;

# first get a list of all schools

my $ua = LWP::UserAgent->new;

$ua->agent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"); 

#pretending to be firefox on linux.

for my $i (0..10000) {
  my $request = HTTP::Request->new(GET => sprintf(" here to put the URL into =%d", $i));
  $request->header('Accept' => 'text/html');
  my $response = $ua->request($request);
  if ($response->is_success) {
    $pagecontent = $response -> content;
  }
# now we can do whatever with the $pagecontent

}
my $request = POST $url,
[
          order => "schule_ort",
          schulname => undef, 
          Basisdaten => undef,        
          Profil  => undef, 
          Schulort => undef, 
          typid => "11",
          Fax  => 
          Homepage  => undef, 
          verbinder => "AND"

];

print "getting all schools - this could take some time\n";
my $response = $ua->request($request);

# extract the ids
my @ids = $response->content =~ /getSchoolDetail\((\d+)/gs;
print "found " . scalar @ids . " schools\n";

# for this demo we only do the first 5
my @ids_to_do = @ids[0..4];

# use your own user and password
my $dbh = DBI->connect("DBI:mysql:database=schulen", "user", "pass", { AutoCommit => 0 }) or die $!;

my $sth = $dbh->prepare(<<sqlend);
   insert into schulen ( name , plz , ort, strasse , tel, fax , mail, quelle , original_id )
               values  ( ?, ?, ?, ?, ?, ?, ?, ?, ? )
sqlend

# now loop over ids
for my $id (@ids_to_do) {

  # get detail information for id
  my $res = $ua->get(" here to put the URL into => &gid=$id");

  # parse the response
  my $tree = HTML::TreeBuilder::XPath->new;
  $tree->parse($res->content);

  my $xpath = q|//div[@id='MCinhview']//div[@class='floatbox']//table|;
  my ($adress_table, $tel_table) = $tree->findnodes($xpath);

  my ($adr) = $adress_table->find("td");
  my ($name, $city, $street) = map { s/^\s*//; s/\s*$//; $_ } ($adr->content_list)[2,4,6];

  my($plz, $ort) = $city =~ /^(\d+)\s*(.*)/;
  my ($tel, $fax, $mail) = map { s/^\s*//; s/\s*$//; $_ } map { ($_->content_list)[1] } $tel_table->find("td");

  $sth->execute($name, $plz, $ort, $street, $tel, $fax, $mail, "SA", $id);
  $dbh->commit;

  $tree->delete;

  print "$name done\n";
}

我想遍历结果，因此我尝试应用相应的 URL，但出现了一堆错误：

suse-linux:/usr/perl # perl perl_mecha_example_two.pl
全局符号“$pagecontent”在 perl_mecha_example_two.pl 第 24 行需要明确的包名称。
全局符号“$url”在 perl_mecha_example_two.pl 第 29 行需要明确的包名称。
perl_mecha_example_two.pl 的执行由于编译错误而中止 (#1)
    (F) 你说过“use strict”或“use strict vars”，这表明
    所有变量都必须是词法范围的（使用“my”或“state”），
    使用“我们的”预先声明，或明确有资格说
    全局变量在哪个包中（使用“::”）。

用户代码中未捕获的异常：
全局符号“$pagecontent”在 perl_mecha_example_two.pl 第 24 行需要明确的包名称。
全局符号“$url”在 perl_mecha_example_two.pl 第 29 行需要明确的包名称。
perl_mecha_example_two.pl 的执行由于编译错误而中止。
在 perl_mecha_example_two.pl 第 86 行

现在是调试部分。我要改变什么？如何以正确的方式应用 URL？

当我使用严格时，我不允许在声明变量之前使用它。通常的解决方法是在它的第一次出现时添加my, 例如my $urland my $pagecontent。

score 4 · Accepted Answer

它很简单：

#!/usr/bin/perl -W

use LWP::UserAgent;

my $ua = LWP::UserAgent->new;
$ua->agent("Mozilla/5.0 (X11; U; Linux i686; en-US; rv:1.8.1.7) Gecko/20070914 Firefox/2.0.0.7"); #pretending to be firefox on linux.
for my $i (0..10000) {
  my $req = HTTP::Request->new(GET => sprintf("http://path/to/url?=%d", $i));
  $req->header('Accept' => 'text/html');
  my $res = $ua->request($req);
  if ($res->is_success) {
    $pagecontent = $res -> content;
  }
# Do whatever with the $pagecontent
}

这是假设您要获取所有 10000 个页面。如果您只想获取特定的数字，那么您应该尝试将这些数字放入数组中，然后遍历该数组，而不是 1..10000

perl - 如何使用 Perl 的 LWP::UserAgent 来获取具有不同查询字符串的相同 URL？

1 回答 1

Related

Reference