1

我目前正在尝试使用 WWW::Mechanize 创建一个 Perl webspider。

我要做的是创建一个网络蜘蛛,它将抓取整个网站的 URL(由用户输入)并从网站上的每个页面中提取所有链接。

但是我有一个问题,如何爬取整个站点以获取每个链接,而没有重复 到目前为止我所做的事情(无论如何我都遇到了麻烦):

foreach (@nonduplicates) {   #array contain urls like www.tree.com/contact-us, www.tree.com/varieties....
$mech->get($_);
my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);  #find all links on this page that starts with http://www.tree.com

#NOW THIS IS WHAT I WANT IT TO DO AFTER THE ABOVE (IN PSEUDOCODE), BUT CANT GET WORKING
#foreach (@list) {
#if $_ is already in @nonduplicates
#then do nothing because that link has already been found
#} else {
#append the link to the end of @nonduplicates so that if it has not been crawled for links already, it will be 

我怎么能做到以上?

我这样做是为了尝试爬取整个站点以获取站点上每个 URL 的完整列表,而不会重复。

如果您认为这不是实现相同结果的最佳/最简单方法,我愿意接受想法。

非常感谢您的帮助,谢谢。

4

2 回答 2

1

创建一个散列来跟踪您以前看过哪些链接,并将任何未见过的链接放入其中@nonduplicates进行处理:

$| = 1;
my $scanned = 0;

my @nonduplicates = ( $urlToSpider ); # Add the first link to the queue.
my %link_tracker = map { $_ => 1 } @nonduplicates; # Keep track of what links we've found already.

while (my $queued_link = pop @nonduplicates) {
    $mech->get($queued_link);
    my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);

    for my $new_link (@list) {
        # Add the link to the queue unless we already encountered it.
        # Increment so we don't add it again.
        push @nonduplicates, $new_link->url_abs() unless $link_tracker{$new_link->url_abs()}++;
    }
    printf "\rPages scanned: [%d] Unique Links: [%s] Queued: [%s]", ++$scanned, scalar keys %link_tracker, scalar @nonduplicates;
}
use Data::Dumper;
print Dumper(\%link_tracker);
于 2012-10-31T21:21:04.327 回答
0
use List::MoreUtils qw/uniq/;
...

my @list = $mech->find_all_links(...);

my @unique_urls = uniq( map { $_->url } @list );

现在@unique_urls包含来自@list.

于 2012-10-31T21:11:52.470 回答