string - Perl 比较 2 个不同编码的重音字符串（一个从 UTF8 文件中读取）

Question

我战斗了超过 1 天，谷歌提出了很多解决这个问题的请求，但没有任何结果。:(

实际上，我有以下代码读取使用名称列表编码的文本文件 UTF8，并且我的 perl 脚本在找到特定名称时应该停止。这些名字是用法语给出的，并且经常带有一些口音。那是它开始表现出意外的时候：

所以这里是代码：

#!/usr/bin/perl
$ErrorWordFile = "./myFile.txt";
open FILEcorpus, $ErrorWordFile or die $!;

 while (<FILEcorpus>) 
 {
    chomp;
    $_=~  s/\r|\n//g;
    $normWord=$_;       
        $string="stéphane";

        if( $normWord eq  $string )
        {
          print"\nYES!! does work";

        }
        else
        {
          print"\nNO does NOT work";
        }
}

close(FILEcorpus)

实际上，语料库文件（./myFile.txt）包含“stéphane\n”作为唯一的字符。

它显然来自文件的 UTF8 编码和重音符号，但显然不是那么容易。我尝试了很多东西，包括

use uft8

或者

utf8::decode($normWord); without results

没有任何成功:(

任何想法？？？

非常感谢您的宝贵帮助！

西蒙

score 3 · Accepted Answer

您当前正在尝试比较可能未标准化的 2 字节字符串。

1：use utf8将程序中的字符串文字从字节字符串更改为 Unicode 字符串

2：使用 Unicode 打开文件<:utf8，以便将输入理解（解码）为 Unicode。

3：use Unicode::Normalize将两个字符串转换为相同的规范化格式。

score 3 · Accepted Answer

尝试这个。

#!/usr/bin/perl
use strict;
use warnings;
use utf8;  # This is needed because of the literal "stéphane" in the below code

my $ErrorWordFile = "./myFile.txt";
open my $FILEcorpus, '<:utf8', $ErrorWordFile or die $!;

while ( my $normWord = <$FILEcorpus> ) {
    chomp $normWord;
    $normWord =~ s/\r|\n//g;
    my $string = "stéphane";

    if ( $normWord eq $string ) {
        print "YES!! does work\n";
    }
    else {
        print "NO does NOT work\n";
    }
}

close $FILEcorpus;

您需要告诉 Perl 您正在读取的文件是 UTF-8 并且您要与之比较的字符串是 UTF-8

score 0 · Accepted Answer

Many thanks for your explanation, actually the answer provided by Tjd works fine and helps me a lot (since I was fighting with this problem for long days already!!)

So here is the modified code according to your comments:

#!/usr/bin/perl

use utf8; #ADDED
use Unicode::Normalize; #ADDED

$ErrorWordFile = "./myFile.txt";
open FILEcorpus,'<:utf8',$ErrorWordFile or die $!; #CHANGED

 while (<FILEcorpus>) 
 {
    chomp;
    $_=~  s/\r|\n//g;
    $normWord=$_;       
        $string="stéphane";

    $FCD_string = Unicode::Normalize::NFD($string); #ADDED
    $FCD_normWord = Unicode::Normalize::NFD($normWord); #ADDED

        if( $FCD_normWord eq  $FCD_string )
        {
          print"\nYES!! does work";

        }
        else
        {
          print"\nNO does NOT work";
        }
}

close(FILEcorpus)

so THANKS a lot!!

Sb

string - Perl 比较 2 个不同编码的重音字符串（一个从 UTF8 文件中读取）

3 回答 3

Related

Reference