0

下面我附上我的 Perl 脚本。我正在用日语中的一个等价物测试数字 1234。(我从维基百科复制......也许它不是 100% 正确的)。

使用

\p{decimal number}+
\p{Number}+
\d+

该代码适用于 ASCII 版本,但对于日语,我只找到以下示例:

[0-9\x{3041}-\x{3096}\x{30a1}-\x{30fc}\x{4e00}-\x{9faf}]

在这种情况下我做错了什么?

use 5.016;

use utf8;
use charnames   qw< :full >;
use feature     qw< unicode_strings >;

use Test::More tests => 2;

sub is_valid {
  my $string = shift;

  $string ~~ /^[0-9\x{3041}-\x{3096}\x{30a1}-\x{30fc}\x{4e00}-\x{9faf}]+$/u

  #/\p{decimal number}+/msx
}

ok(is_valid("1234"), "ascii");
ok(is_valid("壱弐参四"), "japanese");
4

1 回答 1

0

您的代码在 v5.14 上为我传递。

/u不会做你认为它在那里做的事情,因为你在模式中只有 ASCII。您需要 v5.16,并且在 v5.14 中出现。除非您尝试使用某些 v5.16 增强功能,否则没有什么大不了的。

正如许多人所指出的,数字和数字之间存在语义差异。我认为您只想匹配一系列数字。问题是 UCS 没有将您要匹配的字符标记为数字。

因此,您创建了一个非常广泛的角色类来做到这一点。我认为你坚持这一点。你可能不想继续这样做。您可以将其全部隐藏在一个子例程中,但您也可以定义其他属性。您创建了一个特殊命名的子例程,它返回一个字符串,其中包含字符范围的行作为十六进制值。这是perlunicode的示例:

sub InKana {
    return <<END;
3040\t309F
30A0\t30FF
END
}

您可以使用Unicode::Unihan模块来找出您想要的点。您可以使用代码来完成,但所有这些都是在查找与该方法同名的Unihan 数据库文件中完成的。真正懂日语的人将不得不对此进行调整以选择正确的字符:

use v5.10;

use Number::Range;
use Unicode::Unihan;

my $db = Unicode::Unihan->new;
my $range = Number::Range->new;

foreach my $u ( 0 .. 0x01dfff ) {
    my $char = chr $u;
    next unless $char =~ /\p{Script: Han}/;
    my $value = 
        $db->PrimaryNumeric( $char ) ||
        $db->AccountingNumeric( $char ) ||
        $db->OtherNumeric( $char )
        ;
    next unless defined $value;
    my $hex = sprintf "%X", $u;
    say chr($u), " (U+$hex) has numeric value: ", $value;
    $range->addrange( $u );
    }

my $sub = 
q(sub InJapaneseDigit {
    return <<'HERE';
)

.

join( "\n", 
    map { 
        join "\t", 
            map { sprintf "%X", $_ } 
            split /\.\./;  
        } 
    split /,/, $range->range 
    )

.

qq(\nHERE\n});

say $sub;

该程序输出:

㐅 (U+3405) has numeric value: 5
㒃 (U+3483) has numeric value: 2
㠪 (U+382A) has numeric value: 5
㭍 (U+3B4D) has numeric value: 7
一 (U+4E00) has numeric value: 1
七 (U+4E03) has numeric value: 7
万 (U+4E07) has numeric value: 10000
三 (U+4E09) has numeric value: 3
九 (U+4E5D) has numeric value: 9
二 (U+4E8C) has numeric value: 2
五 (U+4E94) has numeric value: 5
亖 (U+4E96) has numeric value: 4
亿 (U+4EBF) has numeric value: 100000000
什 (U+4EC0) has numeric value: 10
仟 (U+4EDF) has numeric value: 1000
仨 (U+4EE8) has numeric value: 3
伍 (U+4F0D) has numeric value: 5
佰 (U+4F70) has numeric value: 100
億 (U+5104) has numeric value: 100000000
兆 (U+5146) has numeric value: 1000000000000
兩 (U+5169) has numeric value: 2
八 (U+516B) has numeric value: 8
六 (U+516D) has numeric value: 6
十 (U+5341) has numeric value: 10
千 (U+5343) has numeric value: 1000
卄 (U+5344) has numeric value: 20
卅 (U+5345) has numeric value: 30
卌 (U+534C) has numeric value: 40
叁 (U+53C1) has numeric value: 3
参 (U+53C2) has numeric value: 3
參 (U+53C3) has numeric value: 3
叄 (U+53C4) has numeric value: 3
四 (U+56DB) has numeric value: 4
壱 (U+58F1) has numeric value: 1
壹 (U+58F9) has numeric value: 1
幺 (U+5E7A) has numeric value: 1
廾 (U+5EFE) has numeric value: 9
廿 (U+5EFF) has numeric value: 20
弌 (U+5F0C) has numeric value: 1
弍 (U+5F0D) has numeric value: 2
弎 (U+5F0E) has numeric value: 3
弐 (U+5F10) has numeric value: 2
拾 (U+62FE) has numeric value: 10
捌 (U+634C) has numeric value: 8
柒 (U+67D2) has numeric value: 7
漆 (U+6F06) has numeric value: 7
玖 (U+7396) has numeric value: 9
百 (U+767E) has numeric value: 100
肆 (U+8086) has numeric value: 4
萬 (U+842C) has numeric value: 10000
貮 (U+8CAE) has numeric value: 2
貳 (U+8CB3) has numeric value: 2
贰 (U+8D30) has numeric value: 2
阡 (U+9621) has numeric value: 1000
陆 (U+9646) has numeric value: 6
陌 (U+964C) has numeric value: 100
陸 (U+9678) has numeric value: 6

sub InJapaneseDigit {
        return <<'HERE';
3405
3483
382A
3B4D
4E00
4E03
4E07
4E09
4E5D
4E8C
4E94
4E96
4EBF    4EC0
4EDF
4EE8
4F0D
4F70
5104
5146
5169
516B
516D
5341
5343    5345
534C
53C1    53C4
56DB
58F1
58F9
5E7A
5EFE    5EFF
5F0C    5F0E
5F10
62FE
634C
67D2
6F06
7396
767E
8086
842C
8CAE
8CB3
8D30
9621
9646
964C
9678
HERE
}
于 2013-01-31T21:17:31.153 回答