1

我试图弄清楚为什么这不起作用:

my $url = 'www880740.com';

use Mojo::UserAgent;

my $ua = Mojo::UserAgent->new->max_redirects(3);
$ua->transactor->name( "Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; Gecko/2008052906 Firefox/3.0" );

my $tx = $ua->get(
    $url =>
    { 'Accept-Charset' => 'ISO-8859-1,utf-8;q=0.7,*;q=0.7' }
    );

    my $page_title = $tx->result->dom->at( 'title' )->text;

    print "GOT: $page_title \n";

    foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana  Inherited Kannada Katakana Khmer Lao Limbu  Malayalam  Mongolian Myanmar Ogham Oriya  Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {
      if ($page_title =~ /\p{$type}/) {

          print "$page_title seems to be $type!\n";
          last;

        }
    }

基本上我想测试来自 URL 的标题,并检查它是否匹配任何这些字符集。我假设它是因为我需要将它解码为正则表达式可以找到的东西。当我将页面的“卷曲”版本吞入内存时,它工作正常。Devel::Peek::Dump 给了我:

SV = PV(0x55cd8264d650) at 0x55cd824c4b10
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK)
  PV = 0x55cd82655d80 "\301\371\272\317\264\253\306\34644181.com/\301\371\272\317\264\253\306\346\313\304\262\273\317\361/\302\355\273\341\277\252\275\261\275\341\271\373/\317\343\270\333\301\371\272\317\264\253\306\346/\302\355\273\341\277\252\275\261\274\307\302\274/\317\343\270\333\271\322\305\306|\310\374\302\355\273\341\327\312\301\317"\0
  CUR = 91
  LEN = 96
  COW_REFCNT = 0

更新:我终于得到了这个工作:

my $page_title = $tx->result->dom->at( 'title' )->text;

use Encode;
use Encode::Detect;
use Encode::HanExtra;
my $page_title = decode("Detect", $page_title);
  
print "GOT: $page_title \n";

foreach my $type (qw/Arabic Armenian Bengali Bopomofo Braille Buhid Canadian_Aboriginal Cherokee Cyrillic Devanagari Ethiopic Georgian Greek Gujarati Gurmukhi Han Hangul Hanunoo Hebrew Hiragana  Inherited Kannada Katakana Khmer Lao Limbu  Malayalam  Mongolian Myanmar Ogham Oriya  Runic Sinhala Syriac Tagalog Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/) {

  if ($page_title =~ /\p{Script_Extensions=$type}/) {

      print "$page_title seems to be $type!\n";
      last;

    }
}

这一点:

my $page_title = decode("Detect", $page_title);

检测到检测编码的尝试,然后转换为 Perl 的内部表示(准备好让我的正则表达式工作)。我试图发布我的示例输出,但由于某种原因它触发了垃圾邮件?

4

1 回答 1

2

标题在charset=gb2312其中需要被解码为 perl 内部表示。

以下代码演示了解码和输出以控制台此特定网站的标题。

use strict;
use warnings;
use feature 'say';

use utf8;

use Mojo::UserAgent;
use Encode qw/encode decode/;

binmode STDOUT, 'encoding(UTF-8)';

my $url = 'www880740.com';
my $ua  = Mojo::UserAgent->new->max_redirects(3);

$ua->transactor->name( 'Mozilla/5.0 (Windows; U; Windows NT 5.1; pl; rv:1.9; Gecko/2008052906 Firefox/3.0' );

my $res = $ua->get( $url )->result;

my $page_title = decode('euc-cn',$res->dom->at('title')->text);

say 'GOT: ' . $page_title;

exit;

my @langs = qw/Arabic Armenian Bengali Bopomofo Braille Buhid
               Canadian_Aboriginal Cherokee Cyrillic Devanagari
               Ethiopic Georgian Greek Gujarati Gurmukhi Han
               Hangul Hanunoo Hebrew Hiragana  Inherited Kannada
               Katakana Khmer Lao Limbu  Malayalam  Mongolian
               Myanmar Ogham Oriya Runic Sinhala Syriac Tagalog
               Tagbanwa TaiLe Tamil Telugu Thaana Thai Tibetan/;

for( @langs ) {
    say "$page_title matches $_!" if $page_title =~ /\p{$_}/;
}
于 2020-11-02T09:52:24.970 回答