perl - Perl 的“标准字符串比较顺序”是什么？

Question

这确实是一个双重问题，我的两个最终目标有以下答案：

就机制而言，标准的字符串比较顺序是什么？
什么是更好的名称，以便我可以更新文档？

Perl 的排序文档说，没有块，sort使用“标准字符串比较顺序”。但是这个命令是什么？它应该有一个更好的名字。对于这个问题，我特别指的是语言环境无效的情况，因为它定义了它自己的顺序。

在过去的几年里，我们通常将标准排序顺序称为“ASCIIbetically”。它在学习 Perl和许多其他书籍中。但是，该术语已过时。Perl 从 5.6 开始就支持 Unicode。谈论 ASCII 是老派。因为 Perl 也支持 Unicode，所以它知道字符串。在sv.c中，Perl_sv_cmp了解locale、bytes和 UTF-8。前两个很容易。但我对第三个没有信心。

/*
=for apidoc sv_cmp

Compares the strings in two SVs.  Returns -1, 0, or 1 indicating whether the
string in C<sv1> is less than, equal to, or greater than the string in
C<sv2>. Is UTF-8 and 'use bytes' aware, handles get magic, and will
coerce its args to strings if necessary.  See also C<sv_cmp_locale>.

=cut
*/

当 Perl 使用 UTF-8 进行排序时，它真正的排序是什么？字符串编码的字节，它代表的字符（可能包括标记？），还是其他？我认为这是 sv.c 中的相关行（提交 7844ec1 的第 6698 行）：

 pv1 = tpv = (char*)bytes_to_utf8((const U8*)pv1, &cur1);

如果我没看错（使用我生锈的 C），pv1则被强制转换为八位字节，转换为 UTF-8，然后转换为字符（在 C 意义上）。我认为这意味着它是按 UTF-8 编码排序的（即 UTF-8 用来表示代码点的实际字节）。另一种说法是它不对字素进行排序。我想我几乎已经说服自己我没看错，但你们中的一些人比我更了解这一点。

由此，下一个有趣的行是 6708：

 const I32 retval = memcmp((const void*)pv1, (const void*)pv2, cur1 < cur2 ? cur1 : cur2);

对我来说，这看起来就像曾经有pv1和pv2，被强制到char *，现在只是逐字节比较，因为它们被强制到void *。会发生这种情况吗memcmp，看起来它只是根据我迄今为止阅读的各种文档比较位？同样，我想知道我在从 bytes->utf8->char->bytes 的旅程中缺少什么，比如可能是 Unicode 规范化步骤。Perl_bytes_to_utf8在utf8.c中签出并没有帮助我回答这个问题。

作为旁注，我想知道这是否与Unicode Collation Algorithm相同？如果是，为什么Unicode::Collate存在？从表面上看，我不认为 Perlsort处理规范等价。

score 14 · Accepted Answer

UTF-8 has the property that sorting a UTF-8 string byte-by-byte according to the byte value gives the same ordering as sorting it codepoint-by-codepoint according to the codepoint number. That is, I know without looking that the UTF-8 representation of U+2345 is lexicographically after the UTF-8 representation of U+1234.

As for normalization, the Perl core doesn't know anything about it; to get accurate sorting and comparison among the different forms you would want to run all of your strings through Unicode::Normalize and convert them all to the same normalization form. I can't comment on which is best for any given purpose, mostly because I have no clue.

Also, sorting and cmp are affected by the locale pragma if it's in use; it uses the POSIX collation order. Using use locale, an 8-bit locale, and unicode all together is a recipe for disaster, but using use locale, a UTF-8 locale, and unicode should work usefully. I can't say I've tried it. There's an awful lot of info in perllocale and perlunicode anyway.

score 5 · Accepted Answer

我无法回答整个问题，所以让我在其中一个方面进行磨练：

    const I32 retval = memcmp((const void*)pv1, (const void*)pv2, cur1 < cur2 ? cur1 : cur2);

...看起来曾经有pv1和pv2，被强制为char *，现在只是逐字节比较，因为它们被强制为void *。这是发生了什么memcmp

差不多。memcmp和之间的主要区别strcmp是：

strcmp一旦它看到 a NULL(ie, '\0') 就会停止，并且 Perl 允许标量嵌入NULLs
memcmp通常运行速度比strcmp

但除此之外，您将获得相同的结果。

perl - Perl 的“标准字符串比较顺序”是什么？

2 回答 2

Related

Reference