4

尝试使用 LWP::Useragent 和 Encode 进行字符编码从网页中拉回全局地址时,我在 perl 中遇到编码问题。我试过谷歌搜索解决方案,但似乎没有任何效果。我正在使用草莓 Perl 5.12.3。

以美国驻捷克大使馆地址页面为例(http://prague.usembassy.gov/contact.html)。我想要的只是拉回地址:

地址:Tržiště 15 118 01 Praha 1 - Malá Strana 捷克共和国

哪个 firefox 使用与网页标题字符集相同的字符编码 UTF-8 正确显示。但是,当我尝试使用 perl 将其拉回并将其写入文件时,尽管在 Useragent 或 Encode::decode 中使用了 decoded_content,但编码看起来还是一团糟。

我尝试在数据上使用正则表达式来检查错误不是在打印数据时(即在 perl 中内部正确),但错误似乎在于 perl 如何处理编码。

这是我的代码:

#!/usr/bin/perl

require Encode;
require LWP::UserAgent;
use utf8;

my $ua = LWP::UserAgent->new;
$ua->timeout(30);
$ua->env_proxy;

my $output_file;
$output_file = "C:/Documents and Settings/ian/Desktop/utf8test.txt";
open (OUTPUTFILE, ">$output_file") or die("Could not open output file $output_file: $!" );
binmode OUTPUTFILE, ":utf8";
binmode STDOUT, ":utf8";

# US embassy in Czech Republic webpage
$url = "http://prague.usembassy.gov/contact.html";

$ua_response = $ua->get($url);
if (!$ua_response->is_success) { die "Couldn't get data from $url";}

print 'CONTENT TYPE: '.$ua_response->content_charset."\n";
print OUTPUTFILE 'CONTENT TYPE: '.$ua_response->content_charset."\n";

my $content_not_decoded;
my $content_ua_decoded;
my $content_Endode_decoded;
my $content_double_decoded;

$ua_response->content =~ /<p><b>Address(.*?)<\/p>/;
$content_not_decoded = $1;
$ua_response->decoded_content =~ /<p><b>Address(.*?)<\/p>/;
$content_ua_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_Endode_decoded = $1;
Encode::decode_utf8($ua_response->content) =~ /<p><b>Address(.*?)<\/p>/;
$content_double_decoded = $1;

# get the content without decoding
print 'UNDECODED CONTENT:'.$content_not_decoded."\n";
print OUTPUTFILE 'UNDECODED CONTENT:'.$content_not_decoded."\n";

# print the decoded content
print 'DECODED CONTENT:'.$content_ua_decoded."\n";
print OUTPUTFILE 'DECODED CONTENT:'.$content_ua_decoded."\n";

# use Encode to decode the content
print 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";
print OUTPUTFILE 'ENCODE::DECODED CONTENT:'.$content_Endode_decoded."\n";

# try both!
print 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";
print OUTPUTFILE 'DOUBLE-DECODED CONTENT:'.$content_double_decoded."\n";

# check for #-digit character in the strings (to guard against the error coming in the print statement) 
if ($content_not_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN UNDECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_ua_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; 
    print OUTPUTFILE "AMPERSAND FOUND IN DECODED CONTENT- LIKELY ENCODING ERROR\n"; 
}
if ($content_Endode_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN ENCODE::DECODED CONTENT- LIKELY ENCODING ERROR\n";
}
if ($content_double_decoded =~ /\&/) {
    print "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
    print OUTPUTFILE "AMPERSAND FOUND IN DOUBLE-DECODED CONTENT- LIKELY ENCODING ERROR\n";
}

close (OUTPUTFILE);
exit;

这是终端的输出:

内容类型:UTF-8 未解码内容::
Tr├à┬╛išt├ä┬¢ 15
118 01 Praha 1 - Malá Strana
捷克共和国 解码内容::
Tr┼╛išt─¢ 15
118 01 Praha 1 - Malá Strana
捷克共和国编码::解码内容::
Tr┼╛išt─¢ 15
118 01 Praha 1 - Malá Strana
捷克共和国 双解码内容::Tr┼╛išt─¢ 15
118 01 Praha 1 - Malá StranaCzech Republic AMPERSAND 在未解码内容中发现-在解码内容中发现可能的编码错误和符号 - 在编码::解码内容中发现可能的编码错误和符号 - 在双重解码的内容中发现可能的编码错误和符号 - 可能是编码错误

和文件(注意这与终端略有不同但不正确)。OK WOW-这在堆栈溢出中显示为正确,但在 Bluefish、LibreOffice、Excel、Word 或我计算机上的任何其他内容中显示为正确。所以数据在那里只是被错误地编码。我真的不明白发生了什么事。

内容类型:UTF-8 未解码内容::
TržištÄ 15
118 01 Praha 1 - Malá Strana
Czech Republic DECODED CONTENT::
Tržiště 15
118 01 Praha 1 - Malá Strana
Czech Republic ENCODE::DECODED CONTENT::
Tržiště 15
118 01 Praha 1 - Malá Strana
捷克共和国双解码内容::Tržiště 15
118 01 Praha 1 - Malá StranaCzech Republic 在未解码的内容中发现 & 符号 - 可能在解码内容中发现 & 符号 - 可能是编码错误 & 在编码中发现::解码内容 - 可能是编码错误在双重解码内容中发现 & 符号 - 可能存在编码错误

任何如何做到这一点的指针都非常感激。

谢谢,伊恩/蒙特克里斯托

4

2 回答 2

5

错误是使用正则表达式来解析 HTML。至少,您缺乏对 HTML 实体的解码。您可以手动执行此操作,也可以将其留给强大的解析器:

use strictures;
use Web::Query 'wq';
use autodie qw(:all);

open my $output, '>:encoding(UTF-8)', '/tmp/embassy-prague.txt';
print {$output} wq('http://prague.usembassy.gov/contact.html')->find('p')->first->html; # or perhaps ->text
于 2012-06-27T08:01:48.210 回答
2
#!/usr/bin/env perl

use v5.12;
use strict;
use warnings;
use warnings qw(FATAL utf8);
use open     qw(:std :utf8);

use LWP::Simple;
use HTML::Entities;

my $content = get 'http://prague.usembassy.gov/contact.html';

my ($address) = ($content =~  m{<p><b>Address(.*?)</p>});
decode_entities($address);

say $address;

从命令行:

C:\temp> uu > tt.txt

C:\temp> gvim tt.txt

并以 GVim(UTF8 模式)显示以下文本:

</b>:<br />Tržiště 15<br />118 01 Praha 1 - Malá Strana<br />Czech Republic

另见Tom Christiansen 的标准序言

于 2012-06-27T07:40:50.907 回答