0

I've been using HTML::SimpleLinkExtor to extract links from this page: http://cpc.cs.qub.ac.uk/authorIndex/AUTHOR_index.html Although it works great for everyting, it doesn't when one link has 'Ç' as a character. What it does it changes it to %C7. Therefore when I use the link in the rest of my program I get a code 404 error. Here's my code:

#!/usr/bin/perl

use strict;
use warnings;
use HTML::SimpleLinkExtor;
use Time::HiRes qw(sleep);
use Test::WWW::Selenium;
use Test::More "no_plan"; #tests => 37; #
#use Test::Exception;


Test::More->builder->output ('result.txt');
Test::More->builder->failure_output ('errors.txt');

my $base = "http://cpc.cs.qub.ac.uk/authorIndex/AUTHOR_index.html";

my $sel = Test::WWW::Selenium->new( host        => "localhost", 
                                    port        =>  4444, 
                                    browser     => "*firefox", 
                                    browser_url => "http://cpc.cs.qub.ac.uk/" );


################################################
my  $extor = HTML::SimpleLinkExtor->new($base);
    $extor->parse_url($base);           
my  @all_links   = $extor->a;           
################################################


$sel->start();

            $sel->open_ok($base);

            $sel->open_ok($_) foreach (@all_links);

$sel->stop();

As well, are there any ideas how I can implement the click() function with the extracted links .

Thanks

4

1 回答 1

5

该网页以 latin1 编码提供服务,因此它将Ç编码为字节 0xC7。尽管如此,HTML::SimpleLinkExtor应该足够聪明,可以将链接转换为 UTF-8,因为这几乎是标准的。然而它并没有这样做。在其来源中它说:

sub parse_url {
    my $data = $_[0]->ua->get( $_[1] )->content;
    return unless $data;
    $_[0]->parse( $data );
}

这里的错误是它应该使用->decoded_content而不是->content以便正确进行编码转换。您可能想要为 HTML::SimpleLinkExtor 提交错误报告。同时,您可以尝试编写自己的方法来替换这个损坏的方法。

编辑:这可能有效(未经测试):

# replace this:
$extor->parse_url($base);           

# with this:
my $data = $extor->ua->get($base)->decoded_content;
if (defined $data) {
    $extor->parse($data);
}
于 2013-07-25T14:01:53.233 回答