postgresql - Unicode 字符默认排序规则表

Question

我不知道这个问题到底属于哪个网站，所以在这里发布。

我在 RHEL 6.4 上使用 Postgresql 9.2 并观察以下内容：

select foo 
from unnest('{а,ә,б,в,г,д,е,ж}'::text[]) as foo 
order by foo collate "kk_KZ.utf8"

给

а
ә
б
в
г
д
е
ж

但

select foo 
from unnest('{а,ә,б,в,г,д,е,ж}'::text[]) as foo 
order by foo collate "en_US.utf8"

给

а
б
в
г
д
е
ә -- misplaced
ж

此外，我发现有 Default Unicode Collation Element Table [1]，它以正确的顺序列出了有问题的字符 (04D9 ; [.199D.0020.0002.04D9] # CYRILLIC SMALL LETTER SCHWA)。

我知道期望“en_US.utf8”区域设置正确处理西里尔字符是愚蠢的，但是在字符通常不属于所使用的语言/区域设置的情况下，Unicode 或任何其他相关标准的正确行为是什么整理？

[1] http://www.unicode.org/Public/UCA/latest/allkeys.txt

score 4 · Accepted Answer

It's not misplaced. It might be to you, but it's not to me. :-) In all seriousness, there is no correct behavior by Unicode; there simply cannot be. A character set is a mapping; the collation is a locale-specific set of rules to sort the characters in that set -- and even within the same locale there can be multiple collations.

The ICU docs has colorful examples of how thorny this kind of stuff gets, in case you're curious. Quoting extensively:

http://userguide.icu-project.org/collation

[H]ere are some of the ways languages vary in ordering strings:

The letters A-Z can be sorted in a different order than in English. For example, in Lithuanian, "y" is sorted between "i" and "k".

Combinations of letters can be treated as if they were one letter. For example, in traditional Spanish "ch" is treated as a single letter, and sorted between "c" and "d".

Accented letters can be treated as minor variants of the unaccented letter. For example, "é" can be treated equivalent to "e".

Accented letters can be treated as distinct letters. For example, "Å" in Danish is treated as a separate letter that sorts just after "Z".

Unaccented letters that are considered distinct in one language can be indistinct in another. For example, the letters "v" and "w" are two different letters according to English. However, "v" and "w" are considered variant forms of the same letter in Swedish.

A letter can be treated as if it were two letters. For example, in traditional German "ä" is compared as if it were "ae".

Thai requires that the order of certain letters be reversed.

French requires that letters sorted with accents at the end of the string be sorted ahead of accents in the beginning of the string. For example, the word "côte" sorts before "coté" because the acute accent on the final "e" is more significant than the circumflex on the "o".

Sometimes lowercase letters sort before uppercase letters. The reverse is required in other situations. For example, lowercase letters are usually sorted before uppercase letters in English. Latvian letters are the exact opposite.

Even in the same language, different applications might require different sorting orders. For example, in German dictionaries, "öf" would come before "of". In phone books the situation is the exact opposite.

Sorting orders can change over time due to government regulations or new characters/scripts in Unicode.

score 2 · Accepted Answer

Unicode 排序算法允许对 DUCET 进行任何剪裁。

没有“正确”的行为。人们可以期待各种各样的行为，最合适的行为取决于上下文和观众。有时任何行为都可能是正确的，因为实际上没有理由在美式英语排序规则中强制使用任何西里尔字母更好的顺序。

Common Locale Data Repository为 DUCET 提供了特定于语言环境的定制。CLDR 使用 LDML（区域设置数据标记语言）来指定剪裁，语法由Unicode Technical Specification #35, part 5给出。

CLDR 为 en_US 提供的数据的最新版本没有定制：它使用DUCET 的修改版本（如 UTS#35 中“根排序规则”下所述）。它在西里尔字母 A 之后列出了西里尔字母 schwa，即您期望的顺序。

还有一个 en_US_POSIX 语言环境的数据，其中包括一些修改，但没有改变任何不是 ASCII 的内容。

似乎安装在您系统中的 en_US 语言环境使用了一种剪裁，将 schwa 放在 E 旁边，可能是因为它们的形式相似。可以说，这比将 schwa 排在 A 之后给美国英语观众带来的惊喜更少：问人们那是什么，看看有多少人会告诉你这是一个“颠倒的 E”。这没有对错之分，但如果你问我，它似乎比 CLDR 中的排序规则更合适。

score 2 · Accepted Answer

Postgresql 使用操作系统提供的语言环境。在您的设置中，语言环境由 glibc 提供。Glibc 使用 ISO 14651 的“古老”版本的大量修改版本（请参阅glibc Bug 14095 - Review / update collation data from Unicode / ISO 14651以获取有关当前尝试更新 glibc 语言环境数据的痛苦的信息）。

从 2018 年 8 月 1 日发布的 glibc 2.28 开始，glibc 将使用来自 ISO 14651:2016 的数据（与 Unicode 9 同步），并将给出 OP 对 en_US 的预期顺序。

ISO 14651是一种比较字符串和描述通用模板定制排序的方法，它与UCA类似，但有一些区别。CTT（通用模板表）是 DUCET 的 ISO14651 等效项，它们是对齐的。

第一次CYRILLIC SMALL LETTER SCHWA出现在 glibc 的排序表中是针对az_AZ语言环境（阿塞拜疆），它在CYRILLIC SMALL LETTER IE. 这对应于：

commit fcababc4e18fee81940dab20f7c40b1e1fb67209
Author: Ulrich Drepper <drepper@redhat.com>
Date:   Fri Aug 3 08:42:28 2001 +0000

    Update.

    2001-08-03  Ulrich Drepper  <drepper@redhat.com>

        * locale/iso-639.def: Add Tigrinya.

从那里，该排序最终被移动到文件iso14651_t1中，根据Bug 672 - Include iso14651_t1 in collation rules，这是为了简化 glibc 语言环境数据。这对应于：

commit 5d2489928c0040d2a71dd0e63c801f2cf98e7efc
Author: Ulrich Drepper <drepper@redhat.com>
Date:   Sun Feb 18 04:34:28 2007 +0000

    [BZ #672]

    2005-01-16  Denis Barbier  <barbier@linuxfr.org>
        [BZ #672]
        * locales/ca_ES: Replace current collation rules by including
        iso14651_t1 and adding extra rules if needed.  There should be
        no noticeable changes in sorted text. only ligatures and
        ignoreable characters have modified weights.
        * locales/da_DK: Likewise.
        * locales/en_CA: Likewise.
        * locales/es_US: Likewise.
        * locales/fi_FI: Likewise.
        * locales/nb_NO: Likewise.

        [BZ #672]
        * locales/iso14651_t1: Simplified.  Extended.

glibc 中的大多数语言环境都从 iso14651_t1 开始，并对其进行定制，这就是您所看到的en_US.

虽然 glibc 基于阿塞拜疆语的默认排序，但 DUCET 却基于哈萨克语和鞑靼语的排序，这就是差异的来源。

postgresql - Unicode 字符默认排序规则表

3 回答 3

Related

Reference