perl - Perl mechanize Find all links 数组循环问题

Question

我目前正在尝试使用 WWW::Mechanize 创建一个 Perl webspider。

我要做的是创建一个网络蜘蛛，它将抓取整个网站的 URL（由用户输入）并从网站上的每个页面中提取所有链接。

但是我有一个问题，如何爬取整个站点以获取每个链接，而没有重复 到目前为止我所做的事情（无论如何我都遇到了麻烦）：

foreach (@nonduplicates) {   #array contain urls like www.tree.com/contact-us, www.tree.com/varieties....
$mech->get($_);
my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);  #find all links on this page that starts with http://www.tree.com

#NOW THIS IS WHAT I WANT IT TO DO AFTER THE ABOVE (IN PSEUDOCODE), BUT CANT GET WORKING
#foreach (@list) {
#if $_ is already in @nonduplicates
#then do nothing because that link has already been found
#} else {
#append the link to the end of @nonduplicates so that if it has not been crawled for links already, it will be

我怎么能做到以上？

我这样做是为了尝试爬取整个站点以获取站点上每个 URL 的完整列表，而不会重复。

如果您认为这不是实现相同结果的最佳/最简单方法，我愿意接受想法。

非常感谢您的帮助，谢谢。

score 1 · Accepted Answer

创建一个散列来跟踪您以前看过哪些链接，并将任何未见过的链接放入其中@nonduplicates进行处理：

$| = 1;
my $scanned = 0;

my @nonduplicates = ( $urlToSpider ); # Add the first link to the queue.
my %link_tracker = map { $_ => 1 } @nonduplicates; # Keep track of what links we've found already.

while (my $queued_link = pop @nonduplicates) {
    $mech->get($queued_link);
    my @list = $mech->find_all_links(url_abs_regex => qr/^\Q$urlToSpider\E/);

    for my $new_link (@list) {
        # Add the link to the queue unless we already encountered it.
        # Increment so we don't add it again.
        push @nonduplicates, $new_link->url_abs() unless $link_tracker{$new_link->url_abs()}++;
    }
    printf "\rPages scanned: [%d] Unique Links: [%s] Queued: [%s]", ++$scanned, scalar keys %link_tracker, scalar @nonduplicates;
}
use Data::Dumper;
print Dumper(\%link_tracker);

score 0 · Accepted Answer

use List::MoreUtils qw/uniq/;
...

my @list = $mech->find_all_links(...);

my @unique_urls = uniq( map { $_->url } @list );

现在@unique_urls包含来自@list.

perl - Perl mechanize Find all links 数组循环问题

2 回答 2

Related

Reference