perl - Perl 中的多语言文本排序，在 Windows 上，使用区域设置

Question

我正在构建一个用于对不同语言的书籍索引进行排序的软件。它使用 Perl，并脱离语言环境。我正在 Unix 上开发它，但它需要可移植到 Windows。这应该在原则上起作用，还是依靠语言环境，我是不是找错了树？最重要的是，Windows 确实是我需要它工作的地方，但我更喜欢在我的 UNIX 环境中进行开发。

score 11 · Accepted Answer

假设您的起点是 Unicode，因为您一直非常小心地解码所有传入的数据，无论其本机编码可能是什么，那么以Unicode::Collate模块为起点很容易使用。

如果您想要区域设置定制，那么您可能想要开始Unicode::Collate::Locale。

解码成 Unicode

如果你在全 UTF8 环境中运行，这很容易，但如果你受制于随机所谓的“语言环境”（或者更糟糕的是，微软称之为“代码页”的丑陋事物）的变迁，那么你可能想要获取 CPANEncode::Locale模块来帮助您。例如：

 use Encode;
 use Encode::Locale;

 # use "locale" as an arg to encode/decode
 @ARGV = map { decode(locale =>  $_) } @ARGV;

 # or as a stream for binmode or open
 binmode $some_fh, ":encoding(locale)";

 binmode STDIN,  ":encoding(console_in)"  if -t STDIN;
 binmode STDOUT, ":encoding(console_out)"  if -t STDOUT;
 binmode STDERR, ":encoding(console_out)"  if -t STDERR;

（如果是我，我只会":utf8"用于输出。）

标准整理，加上语言环境和剪裁

关键是，一旦你把所有东西都解码成内部 Perl 格式，你就可以使用Unicode::Collate它Unicode::Collate::Locale了。这些真的很容易：

   use v5.14;
   use utf8;
   use Unicode::Collate;
   my @exes = qw( x⁷ x⁰ x⁸ x³ x⁶ x⁵ x⁴ x² x⁹ x¹ );
   @exes = Unicode::Collate->new->sort(@exes);
   say "@exes";

   # prints: x⁰ x¹ x² x³ x⁴ x⁵ x⁶ x⁷ x⁸ x⁹

或者他们可以很花哨。这是一个尝试处理书名的方法：它去除了主要文章和零填充数字。

my $collator = Unicode::Collate->new(
    --upper_before_lower => 1,
    --preprocess => {
        local $_ = shift;
        s/^ (?: The | An? ) \h+ //x;  # strip articles
        s/ ( \d+ ) / sprintf "%020d", $1 /xeg;
        return $_;
    };
);

现在只需使用该对象的sort方法进行排序。

有时你需要把排序翻过来。例如：

 my $collator = Unicode::Collate->new();
 for my $rec (@recs) {
     $rec->{NAME_key} = 
        $collator->getSortKey( $rec->{NAME} );
 }
 @srecs = sort {
     $b->{AGE}       <=>  $a->{AGE}
                     ||
     $a->{NAME_key}  cmp  $b->{NAME_key}
 } @recs;

您必须这样做的原因是因为您正在对具有各种字段的记录进行排序。二进制排序键允许您对cmp通过您选择的/自定义整理器对象的数据使用运算符。

collator 对象的完整构造函数具有正式语法的所有这些：

      $Collator = Unicode::Collate->new(
         UCA_Version => $UCA_Version,
         alternate => $alternate, # alias for 'variable'
         backwards => $levelNumber, # or \@levelNumbers
         entry => $element,
         hangul_terminator => $term_primary_weight,
         highestFFFF => $bool,
         identical => $bool,
         ignoreName => qr/$ignoreName/,
         ignoreChar => qr/$ignoreChar/,
         ignore_level2 => $bool,
         katakana_before_hiragana => $bool,
         level => $collationLevel,
         minimalFFFE => $bool,
         normalization  => $normalization_form,
         overrideCJK => \&overrideCJK,
         overrideHangul => \&overrideHangul,
         preprocess => \&preprocess,
         rearrange => \@charList,
         rewrite => \&rewrite,
         suppress => \@charList,
         table => $filename,
         undefName => qr/$undefName/,
         undefChar => qr/$undefChar/,
         upper_before_lower => $bool,
         variable => $variable,
      );

但是您通常不必担心几乎所有这些。事实上，如果您想使用 CLDR 数据来定制特定国家/地区的语言环境，您应该只使用Unicode::Collate::Locale，它正好为构造函数添加了一个参数：locale => $country_code.

 use Unicode::Collate::Locale;
 $coll = Unicode::Collate::Locale->
           new(locale => "fr");
 @french_text = $coll->sort(@french_text);

看看这有多容易？

但你也可以做其他很酷的事情。

 use Unicode::Collate::Locale;
 my $Collator = new Unicode::Collate::Locale::
                 locale => "de__phonebook",
                 level  => 1,
                 normalization => undef,
                ;

 my $full = "Ich müß Perl studieren.";
 my $sub = "MUESS";
 if (my ($pos,$len) = $Collator->index($full, $sub)) {
     my $match = substr($full, $pos, $len);
     say "Found match of literal ‹$sub› in ‹$full› as ‹$match›";

 }

运行时，它说：

 Found match of literal ‹MUESS› in ‹Ich müß Perl studieren.› as ‹müß›

以下是模块 v0.96 中可用的语言环境Unicode::Collate::Locale，取自其手册页：

 locale name       description
--------------------------------------------------------------
 af                Afrikaans
 ar                Arabic
 as                Assamese
 az                Azerbaijani (Azeri)
 be                Belarusian
 bg                Bulgarian
 bn                Bengali
 bs                Bosnian
 bs_Cyrl           Bosnian in Cyrillic (tailored as Serbian)
 ca                Catalan
 cs                Czech
 cy                Welsh
 da                Danish
 de__phonebook     German (umlaut as 'ae', 'oe', 'ue')
 ee                Ewe
 eo                Esperanto
 es                Spanish
 es__traditional   Spanish ('ch' and 'll' as a grapheme)
 et                Estonian
 fa                Persian
 fi                Finnish (v and w are primary equal)
 fi__phonebook     Finnish (v and w as separate characters)
 fil               Filipino
 fo                Faroese
 fr                French
 gu                Gujarati
 ha                Hausa
 haw               Hawaiian
 hi                Hindi
 hr                Croatian
 hu                Hungarian
 hy                Armenian
 ig                Igbo
 is                Icelandic
 ja                Japanese [1]
 kk                Kazakh
 kl                Kalaallisut
 kn                Kannada
 ko                Korean [2]
 kok               Konkani
 ln                Lingala
 lt                Lithuanian
 lv                Latvian
 mk                Macedonian
 ml                Malayalam
 mr                Marathi
 mt                Maltese
 nb                Norwegian Bokmal
 nn                Norwegian Nynorsk
 nso               Northern Sotho
 om                Oromo
 or                Oriya
 pa                Punjabi
 pl                Polish
 ro                Romanian
 ru                Russian
 sa                Sanskrit
 se                Northern Sami
 si                Sinhala
 si__dictionary    Sinhala (U+0DA5 = U+0DA2,0DCA,0DA4)
 sk                Slovak
 sl                Slovenian
 sq                Albanian
 sr                Serbian
 sr_Latn           Serbian in Latin (tailored as Croatian)
 sv                Swedish (v and w are primary equal)
 sv__reformed      Swedish (v and w as separate characters)
 ta                Tamil
 te                Telugu
 th                Thai
 tn                Tswana
 to                Tonga
 tr                Turkish
 uk                Ukrainian
 ur                Urdu
 vi                Vietnamese
 wae               Walser
 wo                Wolof
 yo                Yoruba
 zh                Chinese
 zh__big5han       Chinese (ideographs: big5 order)
 zh__gb2312han     Chinese (ideographs: GB-2312 order)
 zh__pinyin        Chinese (ideographs: pinyin order) [3]
 zh__stroke        Chinese (ideographs: stroke order) [3]
 zh__zhuyin        Chinese (ideographs: zhuyin order) [3]

   Locales according to the default UCA rules include chr (Cherokee), de (German), en (English), ga (Irish), id (Indonesian),
   it (Italian), ka (Georgian), ms (Malay), nl (Dutch), pt (Portuguese), st (Southern Sotho), sw (Swahili), xh (Xhosa), zu
   (Zulu).

   Note

   [1] ja: Ideographs are sorted in JIS X 0208 order.  Fullwidth and halfwidth forms are identical to their regular form.  The
   difference between hiragana and katakana is at the 4th level, the comparison also requires "(variable => 'Non-ignorable')",
   and then "katakana_before_hiragana" has no effect.

   [2] ko: Plenty of ideographs are sorted by their reading. Such an ideograph is primary (level 1) equal to, and secondary
   (level 2) greater than, the corresponding hangul syllable.

   [3] zh__pinyin, zh__stroke and zh__zhuyin: implemented alt='short', where a smaller number of ideographs are tailored.

   Note: 'pinyin' is in latin, 'zhuyin' is in bopomofo.

所以总而言之，主要技巧是将您的本地数据解码为统一的 Unicode 表示，然后使用确定性排序，可能是定制的，不依赖于用户控制台窗口的随机设置来获得正确的行为。

_{注意：所有这些例子，除了手册页的引用，都是从第 4版^ProgrammingPerl中摘取的，得到了作者的许可。:)}

score 1 · Accepted Answer

Win32::OLE::NLS使您可以访问系统的该部分。它为您CompareString和必要的工具提供了获取必要的语言环境 ID。

如果您想要/需要查找系统文档，则底层系统调用名为CompareStringEx.

perl - Perl 中的多语言文本排序，在 Windows 上，使用区域设置

2 回答 2

解码成 Unicode

标准整理，加上语言环境和剪裁

Related

Reference