regex - 当字符串包含非拉丁字符时，Perl 在制表符上使用 split() 函数时出现问题

Question

我正在修改一个 Perl 脚本，该脚本读取一系列 UCS-2LE 编码文件，其中包含制表符分隔格式的字符串，但是当字符串包含扩展之外的字符时，我无法拆分制表符上的字符串拉丁字符集。

这是我从这些文件中读取的示例行（制表符分隔）：

adını   transcript  asr turkish

当我让脚本将这些行写入输出文件以尝试调试此问题时，它正在编写以下内容：

ad1Ů1ĉtranscript    asr turkish

似乎它没有识别土耳其字符之后的制表符。这只发生在单词以非拉丁字符结尾（因此与制表符相邻）时。

这是代码块的一部分，其中发生写入输出文件并发生字符串拆分：

for my $infile (@ARGV){  
    if (!open (INFILE, "<$infile")){
        die "Couldn't open $infile.\n";
    }    

binmode (OUTFILE, ":utf8");

while (<INFILE>) {
    chomp;
    $tTot++;

    if ($lineNo == 1) {                
        $_ = decode('UCS-2LE', $_);      
    }
    else {
        $_ = decode('UCS-2', $_);
    }    

    $_ =~ s/[\r\n]+//g;    
    my @foo = split('\t');

    my $orth = $foo[0];
    my $tscrpt = $foo[1];
    my $langCode = $foo[3];

    if (exists $codeHash{$langCode}) {
      unless ($tscrpt eq '') {
        check($orth, $tscrpt, $langCode);
      }
    }
    else {
        print OUTFILE "Unknown language code $langCode at line $lineNo.\n";
        print OUTFILE $_; # printing the string that's not being split correctly
        print OUTFILE "\n";
        $tBad++;
    }
  }

该脚本的目的是检查输入文件中每一行的语言代码是否有效，并根据该代码，根据我们的转录系统检查每个单词的转录是否“合法”。

这是我到目前为止所尝试的：

将输入字符串的编码更改为 UTF-8、UTF-16 或 UTF-16LE
将 split() 字符更改为 '\w'、/[[:blank:]]/、\p{Blank}、\x{09} 和 \N{U+0009}。
阅读 Perl Unicode 和 perlrebackslash 文档以及我能够在各个站点上找到的任何其他远程相关的帖子

有人对我可能尝试的其他事情有任何建议吗？提前致谢！

我还应该提到，我无法控制输入文件编码和输出文件编码；我必须阅读 UCS-2LE 并输出 UTF-8。

score 1 · Accepted Answer

您应该首先使用正确的编码打开文件（我不知道这是否是正确的，但我相信您的话）。然后你不需要调用 decode()：

open(my $fh, "<:encoding(UCS-2LE)", $file) or die "Error opening $file: $!";
while (<$fh>) {
  ...
}

score 0 · Accepted Answer

感谢大家的评论和一些进一步的研究，我想出了如何解决这个问题，它与我想象的略有不同；结果是 split() 问题和编码问题的结合。我必须在显式打开语句中添加编码，而不是在 for 循环中使用隐式打开，并跳过文件开头的前两个字节。

这是我在问题中发布的部分的更正后的工作代码：

for my $infile (@ARGV){
    my $outfile = $infile . '.out';

    # SOLUTION part 1: added explicit open statement
    open (INFILE, "<:raw:encoding(UCS-2le):crlf", $infile) or die "Error opening $infile: $!";

    # SOLUTION part 2: had to skip the first two bytes of the file 
    seek INFILE, 2, 0;

    if (!open (OUTFILE, ">$outfile")) {
        die "Couldn't write to $outfile.\n";
    }

    binmode (OUTFILE, ":utf8");
    print OUTFILE "Line#\tOriginal_Entry\tLangCode\tOffending_Char(s)\n";

    $tBad = 0;
    $tTot = 0;
    $lineNo = 1;

while (<INFILE>) {
    chomp;
    $tTot++;

    # SOLUTION part 3: deleted the "if" block I had here before that was handling encoding

    # Rest of code in the original block is the same    
}

我的代码现在可以正确识别与不属于扩展拉丁语集的字符相邻的制表符字符，并按应有的方式拆分制表符。

注意：另一种解决方案是用双引号将外来词括起来，但是在我们的例子中，我们不能保证我们的输入文件会以这种方式格式化。

感谢所有评论并帮助我的人！

regex - 当字符串包含非拉丁字符时，Perl 在制表符上使用 split() 函数时出现问题

2 回答 2

Related

Reference