perl - 规范化 ASCII 字符

Question

我需要对诸如“quee”之类的字符串进行规范化，但我似乎无法将扩展的 ASCII 字符（例如 é、á、í 等）转换为罗马/英语版本。我尝试了几种不同的方法，但到目前为止没有任何效果。关于这个一般主题有相当多的材料，但我似乎找不到这个问题的有效答案。

这是我的代码：

#transliteration solution (works great with standard chars but doesn't find the 
#special ones) - I've tried looking for both \x{130} and é with the same result.
$mystring =~ tr/\\x{130}/e/;

#converting into array, then iterating through and replacing the specific char
#( same result as the above solution )
my @breakdown = split( "",$mystring );

foreach ( @breakdown ) {
    if ( $_ eq "\x{130}" ) {
        $_ = "e";
        print "\nArray Output: @breakdown\n";
    }
    $lowercase = join( "",@breakdown );
}

score 9 · Accepted Answer

1）这篇文章应该提供一个相当好的（如果复杂的话）方法。

它提供了将所有带重音的Unicode字符转换为基本字符+重音的解决方案；完成后，您可以简单地单独删除重音字符。

2) 另一个选项是 CPAN：（Text::Unaccent::PurePerl改进的 Pure Perl 版本Text::Unaccent）

3）此外，这个SO答案提出Text::Unidecode：

$ perl -Mutf8 -MText::Unidecode -E 'say unidecode("été")'
  ete

score 7 · Accepted Answer

您的原始代码不起作用的原因\x{130}是不是é。这是拉丁文大写字母 I，上面带点（U+0130 或 İ）。您的意思是\x{E9}或只是\xE9（对于两位数，大括号是可选的），LATIN SMALL LETTER E WITH ACUTE (U+00E9)。

此外，您的tr;中有一个额外的反斜杠。它应该看起来像tr/\xE9/e/。

通过这些更改，您的代码将可以工作，尽管我仍然建议使用 CPAN 上的模块之一来处理此类事情。我自己更喜欢Text::Unidecode，因为它处理的不仅仅是重音字符。

score 3 · Accepted Answer

在工作和重新工作之后，这就是我现在所拥有的。它正在做我想做的一切，除了我想在输入字符串中间保留空格以区分单词。

open FILE, "funnywords.txt";

# Iterate through funnywords.txt
while ( <FILE> ) {
    chomp;

    # Show initial text from file
    print "In: '$_' -> ";

    my $inputString = $_;

    # $inputString is scoped within a for each loop which dissects
    # unicode characters ( example: "é" splits into "e" and "´" )
    # and throws away accent marks. Also replaces all
    # non-alphanumeric characters with spaces and removes
    # extraneous periods and spaces.
    for ( $inputString ) {
        $inputString = NFD( $inputString ); # decompose/dissect
        s/^\s//; s/\s$//;                   # strip begin/end spaces
        s/\pM//g;                           # strip odd pieces
        s/\W+//g;                           # strip non-word chars
    }

    # Convert to lowercase 
    my $outputString = "\L$inputString";

    # Output final result
    print "$outputString\n";
}

不完全确定为什么它将一些正则表达式和评论染成红色......

以下是“funnywords.txt”中的一些行示例：

队列

22.

?éÉíóñúÑ¿¡

[ 。这？]

阿奎，阿利

score 2 · Accepted Answer

对于关于摆脱任何剩余符号但保留字母和数字的第二个问题，将您的最后一个正则表达式从更改s/\W+//g为s/[^a-zA-Z0-9 ]+//g。由于您已经对输入的其余部分进行了规范化，因此使用该正则表达式将删除任何不是 az、AZ、0-9 或空格的内容。在开头使用 [] 和 ^ 表示您要查找不在括号其余部分中的所有内容。

perl - 规范化 ASCII 字符

4 回答 4

Related

Reference