我试图弄清楚为什么这不起作用:
my $url = 'www880740.com';
use Mojo::UserAgent;
my $ua = Mojo::UserAgent->new->max_redirects(3);
$ua->transactor->name( "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; Gecko/2008052906 Firefox/3.0" );
my $tx = $ua->get(
$url =>
{ 'Accept-Charset' => 'ISO-8859-1,utf-8;q=0.7,*;q=0.7' }
);
my $page_title = $tx->result->dom->at( 'title' )->text;
print "GOT: $page_title \n";
foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana Inherited Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
if ($page_title =~ /\p{$type}/) {
print "$page_title seems to be $type!\n";
last;
}
}
基本上我想测试来自 URL 的标题,并检查它是否匹配任何这些字符集。我假设它是因为我需要将它解码为正则表达式可以找到的东西。当我将页面的“卷曲”版本吞入内存时,它工作正常。Devel::Peek::Dump 给了我:
SV = PV(0x55cd8264d650) at 0x55cd824c4b10
REFCNT = 1
FLAGS = (POK,IsCOW,pPOK)
PV = 0x55cd82655d80 "\301\371\272\317\264\253\306\34644181.com/\301\371\272\317\264\253\306\346\313\304\262\273\317\361/\302\355\273\341\277\252\275\261\275\341\271\373/\317\343\270\333\301\371\272\317\264\253\306\346/\302\355\273\341\277\252\275\261\274\307\302\274/\317\343\270\333\271\322\305\306|\310\374\302\355\273\341\327\312\301\317"\0
CUR = 91
LEN = 96
COW_REFCNT = 0
更新:我终于得到了这个工作:
my $page_title = $tx->result->dom->at( 'title' )->text;
use Encode;
use Encode::Detect;
use Encode::HanExtra;
my $page_title = decode("Detect", $page_title);
print "GOT: $page_title \n";
foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana Inherited Kannada Katakana Khmer Lao Limbu Malayalam Mongolian Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
if ($page_title =~ /\p{Script_Extensions=$type}/) {
print "$page_title seems to be $type!\n";
last;
}
}
这一点:
my $page_title = decode("Detect", $page_title);
检测到检测编码的尝试,然后转换为 Perl 的内部表示(准备好让我的正则表达式工作)。我试图发布我的示例输出,但由于某种原因它触发了垃圾邮件?