regex - Perl 日文到英文文件名替换

Question

我整理了一个 perl 脚本，用于将日文文件名替换为英文文件名。但是还有几件事我不太了解。

我有以下配置 客户端操作系统：

Windows XP 日本

记事本++，已安装

服务器：

红帽企业 Linux 服务器 6.2 版

Perl v5.10.1

VIM：VIM 版本 7.2.411

Xterm：ASTEC-X 6.0 版

CSH: tcsh 6.17.00 (Astron)

文件的来源是在 Windows 上生成的日语 .csv 文件。我看到了关于在 Perl 中使用 utf8 和编码转换的帖子，我希望能更好地理解为什么我不需要其他线程中提到的任何内容。

这是我的脚本有效吗？我的问题如下。

#!/usr/bin/perl
my $work_dir = "/nas1_home4/fsomeguy/someplace";
opendir(DIR, $work_dir) or die "Cannot open directory";
my @files = readdir(DIR);
foreach (@files) 
{
    my $original_file = $_; 
    s/機/–machine_/; # replace 機 with -machine_
    my $new_file = $_;
    if ($new_file ne $original_file)
    {
        print "Rename " . $original_file . " to " . $new_file;
        rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or  print "Warning: rename failed because: $!\n";
    }
}

问题：

1) 为什么此示例中不需要 utf8？在什么类型的例子中我需要它。使用uft8；讨论过：使用 utf8 会给我“打印中的宽字符”）？但是如果我添加了 use utf8，那么这个脚本就不起作用了。

2) 为什么此示例中不需要编码操作？
实际上，我使用 Notepad++ 在 Windows 中编写了脚本（将日文字符从 Windows XP Japan 的资源管理器中粘贴到我的脚本中）。在 Xterm 和 VIM 中，字符显示为乱码。但我也不必处理编码操作，这里讨论了如何在 Perl 中将日文字符转换为 unicode？.

谢谢。

更新 1

在 Perl 中测试一个简单的本地化示例以用日语替换文件名和文件文本

在 Windows XP 中，从 .csv 数据文件中复制南字符并复制到剪贴板，然后将其用作文件名（即南.txt）和文件内容（南）。在 Notepad++ 中，在 UTF-8 编码下读取文件显示 x93xEC，在 SHIFT_JIS 下读取显示南。

脚本：

使用以下 Perl 脚本 south.pl，它将在装有 Perl 5.10 的 Linux 服务器上运行

#!/usr/bin/perl
use feature qw(say);

use strict;
use warnings;
use utf8;
use Encode qw(decode encode);

my $user_dir="/usr/frank";
my $work_dir = "${user_dir}/test_south";

# forward declare the function prototypes
sub fileProcess;

opendir(DIR, ${work_dir}) or die "Cannot open directory " . ${work_dir};

# readdir OPTION 1 - shift_jis
#my @files = map { Encode::decode("shift_jis", $_); } readdir DIR; # Note filename    could not be decoded as shift_jis
#binmode(STDOUT,":encoding(shift_jis)");                    

# readdir OPTION 2 - utf8
my @files = map { Encode::decode("utf8", $_); } readdir DIR; # Note filename could be decoded as utf8
binmode(STDOUT,":encoding(utf8)");                           # setting display to output utf8

say @files;                                 

# pass an array reference of files that will be modified
fileNameTranslate();
fileProcess();

closedir(DIR);

exit;

sub fileNameTranslate
{

    foreach (@files) 
    {
        my $original_file = $_; 
        #print "original_file: " . "$original_file" . "\n";     
        s/南/south/;     

        my $new_file = $_;
        # print "new_file: " . "$_" . "\n";

        if ($new_file ne $original_file)
        {
            print "Rename " . $original_file . " to \n\t" . $new_file . "\n";
            rename("${work_dir}/${original_file}", "${work_dir}/${new_file}") or print "Warning: rename failed because: $!\n";
        }
    }
}

sub fileProcess
{

    #   file process OPTION 3, open file as shift_jis, the search and replace would work
    #   open (IN1,  "<:encoding(shift_jis)", "${work_dir}/south.txt") or die "Error: south.txt\n";
    #   open (OUT1, "+>:encoding(shift_jis)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";  

    #   file process OPTION 4, open file as utf8, the search and replace would not work
open (IN1,  "<:encoding(utf8)", "${work_dir}/south.txt") or die "Error: south.txt\n";
    open (OUT1, "+>:encoding(utf8)" , "${work_dir}/south1.txt") or die "Error: south1.txt\n";   

    while (<IN1>)
    {
        print $_ . "\n";
        chomp;

        s/南/south/g;


        print OUT1 "$_\n";
    }

    close IN1;
    close OUT1; 
}

结果：

(BAD) 取消注释选项 1 和 3，(注释选项 2 和 4) 设置：Readdir 编码，SHIFT_JIS；文件打开编码 SHIFT_JIS 结果：文件名替换失败.. 错误：utf8 "\x93" 未映射到 .//south.pl 第 68 行的 Unicode。\x93

（坏）取消注释选项 2 和 4（注释选项 1 和 3）设置：Readdir 编码，utf8；文件打开编码 utf8 结果：文件名替换成功，生成了 south.txt 但是 south1.txt 文件内容替换失败，它有内容 \x93 ()。错误：“\x{fffd}”未映射到 .//south.pl 第 25 行的 shiftjis。... -Ao?= (Bx{fffd}.txt

（GOOD）取消注释选项 2 和 3，（注释选项 1 和 4）设置：Readdir 编码，utf8；文件打开编码 SHIFT_JIS 结果：文件名替换成功，South.txt 生成 South1.txt 文件内容替换成功，内容为南。

结论：

我必须为这个例子使用不同的编码方案才能正常工作。Readdir utf8，文件处理 SHIFT_JIS，因为 csv 文件的内容是 SHIFT_JIS 编码的。

score 2 · Accepted Answer

您的脚本完全不知道 unicode。它将所有字符串视为字节序列。幸运的是，编码文件名的字节与编码源中使用的日文字符的字节相同。如果您告诉 Perl use utf8，它会解释脚本中的日文字符，而不是来自文件系统的字符，因此不会匹配。

regex - Perl 日文到英文文件名替换

1 回答 1

Related

Reference