1

大概的概念


这是我正在使用的一个片段:

my $url_temp;
my $page_temp;
my $p_temp;
my @temp_stuff;
my @collector;

foreach (@blarg_links) {
        $url_temp = $_;
        $page_temp = get( $url_temp ) or die $!;
        $p_temp = HTML::TreeBuilder->new_from_content( $page_temp );
        @temp_stuff = $p_temp->look_down(
                _tag => 'foo',
                class => 'bar'
        );
        foreach (@temp_stuff) {
                push(@collector, "http://www.foobar.sx" . $1) if $_->as_HTML =~ m/href="(.*?)"/;
        };
};

希望很明显,我绝望地试图做的是将在每个链接列表中找到的链接结尾推入一个名为@temp_stuff. 因此,当访问时,中的第一个链接@blarg_links具有大于或等于 1 个foo带有关联bar类的标记,当被操作时,该标记as_HTML将匹配我想要的href等式中的内容,然后泵入一个链接数组,其中包含我真正的数据之后……这有意义吗?


实际数据


my $url2 = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';
my $page2 = get( $url2 ) or die $!;
my $p2 = HTML::TreeBuilder->new_from_content( $page2 );

my @stuff2 = $p2->look_down(
        _tag => 'div',
        class => 'year mini-day-on'
);

my @chem_links;

foreach (@stuff2) {
        push(@chem_links, $1) if $_->as_HTML =~ m/(http:\/\/www\.chemistry\.ucla\.edu\/calendar-node-field-date\/day\/[0-9]{4}-[0-9]{2}-[0-9]{2})/;
};

my $url_temp;
my $page_temp;
my $p_temp;
my @temp_stuff;
my @collector;

foreach (@chem_links) {
        $url_temp = $_;
        $page_temp = get( $url_temp ) or die $!;
        $p_temp = HTML::TreeBuilder->new_from_content( $page_temp );
        @temp_stuff = $p_temp->look_down(
                _tag => 'span',
                class => 'field-content'
        );
};

foreach (@temp_stuff) {
                push(@collector, "http://www.chemistry.ucla.edu" . $1) if $_->as_HTML =~ m/href="(.*?)"/;
};

nb - 我想使用 HTML::TreeBuilder。我知道替代方案。


4

2 回答 2

1

这是我认为你想要的粗略尝试。

它获取第一页上的所有链接并依次访问每个链接,打印每个<span class="field-content">元素中的链接。

use strict;
use warnings;
use 5.010;

use HTML::TreeBuilder;

STDOUT->autoflush;

my $url = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';
my $tree = HTML::TreeBuilder->new_from_url($url);

my @chem_links;

for my $div ( $tree->look_down( _tag => 'div', class => qr{\bmini-day-on\b} ) ) {
  my ($anchor)= $div->look_down(_tag => 'a', href => qr{http://www\.chemistry\.ucla\.edu});
  push @chem_links, $anchor->attr('href');
};

my @collector;

for my $url (@chem_links) {

  say $url;

  my $tree = HTML::TreeBuilder->new_from_url($url);

  my @seminars;

  for my $span ( $tree->look_down( _tag => 'span', class => 'field-content' ) ) {
    my ($anchor) = $span->look_down(_tag => 'a', href => qr{/});
    push @seminars, 'http://www.chemistry.ucla.edu'.$anchor->attr('href');
  }

  say "  $_" for @seminars;
  say '';

  push @collector, @seminars;
};
于 2014-06-04T20:28:16.653 回答
0

对于解析网页的更现代的框架,我建议您看一下Mojo::UserAgentand Mojo::DOM。无需手动遍历 html 树的每个部分,您可以使用css 选择器的强大功能将所需的特定数据归零。有一个关于框架的 8 分钟精彩介绍视频Mojocast Episode 5

# Parses the UCLA Chemistry Calendar and displays all seminar links

use strict;
use warnings;

use Mojo::UserAgent;
use URI;

my $url = 'http://www.chemistry.ucla.edu/calendar-node-field-date/year';

my $ua = Mojo::UserAgent->new;
my $dom = $ua->get($url)->res->dom;

for my $dayhref ($dom->find('div.mini-day-on > a[href*="/day/"]')->attr('href')->each) {
    my $dayurl = URI->new($dayhref)->abs($url);
    print $dayurl, "\n";

    my $daydom = $ua->get($dayurl->as_string)->res->dom;
    for my $seminarhref ($daydom->find('span.field-content > a[href]')->attr('href')->each) {
        my $seminarurl = URI->new($seminarhref)->abs($dayurl);
        print "  $seminarurl\n";
    }

    print "\n";
}

输出与Borodin解决方案相同,使用HTML::TreeBuilder

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-06
  http://www.chemistry.ucla.edu/seminars/nano-rheology-enzymes

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-09
  http://www.chemistry.ucla.edu/seminars/imaging-approach-biology-disease-through-chemistry

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-10
  http://www.chemistry.ucla.edu/seminars/arginine-methylation-%E2%80%93-substrates-binders-function
  http://www.chemistry.ucla.edu/seminars/special-inorganic-chemistry-seminar

http://www.chemistry.ucla.edu/calendar-node-field-date/day/2014-01-13
  http://www.chemistry.ucla.edu/events/robert-l-scott-lecture-0

...
于 2014-06-06T00:56:52.773 回答