2

I am building a basic search engine using vector-space model and this is the crawler for returning 500 URLs and removes the SGML tags from the content. However, it is very slow (takes more than 30mins for retrieving the URLs only). How can I optimize the code? I have inserted wikipedia.org as an example starting URL.

use warnings;

use LWP::Simple;
use LWP::UserAgent;
use HTTP::Request;
use HTTP::Response;
use HTML::LinkExtor;

my $starting_url = 'http://en.wikipedia.org/wiki/Main_Page';
my @urls = $starting_url;
my %alreadyvisited;
my $browser = LWP::UserAgent->new();
$browser->timeout(5);
my $url_count = 0;

while (@urls) 
{ 
     my $url = shift @urls;
     next if $alreadyvisited{$url}; ## check if already visited

     my $request = HTTP::Request->new(GET => $url);
     my $response = $browser->request($request);

     if ($response->is_error())
     {
         print $response->status_line, "\n"; ## check for bad URL
     }
     my $contents = $response->content(); ## get contents from URL
     push @c, $contents;
     my @text = &RemoveSGMLtags(\@c);
     #print "@text\n";

     $alreadyvisited{$url} = 1; ## store URL in hash for future reference
     $url_count++;
     print "$url\n";

     if ($url_count == 500) ## exit if number of crawled pages exceed limit
     {
         exit 0; 
     } 


     my ($page_parser) = HTML::LinkExtor->new(undef, $url); 
     $page_parser->parse($contents)->eof; ## parse page contents
     my @links = $page_parser->links; 

     foreach my $link (@links) 
     {
             $test = $$link[2];
             $test =~ s!^https?://(?:www\.)?!!i;
             $test =~ s!/.*!!;
             $test =~ s/[\?\#\:].*//;
             if ($test eq "en.wikipedia.org")  ## check if URL belongs to unt domain
             {
                 next if ($$link[2] =~ m/^mailto/); 
                 next if ($$link[2] =~ m/s?html?|xml|asp|pl|css|jpg|gif|pdf|png|jpeg/);
                 push @urls, $$link[2];
             }
     }
     sleep 1;
}


sub RemoveSGMLtags 
{
    my ($input) = @_;
    my @INPUTFILEcontent = @$input;
    my $j;my @raw_text;
    for ($j=0; $j<$#INPUTFILEcontent; $j++)
    {
        my $INPUTFILEvalue = $INPUTFILEcontent[$j];
        use HTML::Parse;
        use HTML::FormatText;
        my $plain_text = HTML::FormatText->new->format(parse_html($INPUTFILEvalue));
        push @raw_text, ($plain_text);
    }
    return @raw_text;
}
4

2 回答 2

5
  • 总是 use strict

  • 永远不要&在子程序调用中使用 &号

  • 用于URI操作 URL

sleep 1在那里有一个,我认为这是为了避免过多地敲击该站点,这很好。但是几乎所有基于 Web 的应用程序的瓶颈都是互联网本身,如果不从站点请求更多,您将无法使您的程序更快。这意味着sleep使用例如LWP::Parallel::RobotUA. 这是你应该走的路吗?

于 2013-04-07T16:37:56.310 回答
3

使用 WWW::Mechanize 为您处理所有 URL 解析和提取。比您正在处理的所有链接解析要容易得多。它是专门为您正在做的事情而创建的,它是 LWP::UserAgent 的子类,因此您应该能够将所有 LWP::UserAgent 更改为 WWW::Mechanize 而无需更改任何代码,除了对于所有的链接提取,所以你可以这样做:

my $mech = WWW::Mechanize->new();
$mech->get( 'someurl.com' );
my @links = $mech->links;

然后@links是 WWW::Mechanize::Link 对象的数组。

于 2013-04-07T16:54:51.133 回答