26

Unicode 规范化常见问题解答包括以下段落:

程序应始终比较规范等效的 Unicode 字符串是否相等...... Unicode 标准提供了可用于此的明确定义的规范化形式:NFC 和 NFD。

并继续...

选择使用哪个取决于特定的程序或系统。NFC 是一般文本的最佳形式,因为它与从传统编码转换的字符串更兼容。... NFD 和 NFKD 对内部处理最有用。

我的问题是:

什么使 NFC 最适合“一般文本”。什么定义了“内部处理”,为什么最好留给 NFD?最后,不管什么是“最好的”,只要使用相同的规范化形式比较两个字符串,这两种形式是否可以互换?

4

2 回答 2

10

The FAQ is somewhat misleading, starting from its use of “should” followed by the inconsistent use of “requirement” about the same thing. The Unicode Standard itself (cited in the FAQ) is more accurate. Basically, you should not expect programs to treat canonically equivalent strings as different, but neither should you expect all programs to treat them as identical.

In practice, it really depends on what your software needs to do. In most situations, you don’t need to normalize at all, and normalization may destroy essential information in the data.

For example, U+0387 GREEK ANO TELEIA (·) is defined as canonical equivalent to U+00B7 MIDDLE DOT (·). This was a mistake, as the characters are really distinct and should be rendered differently and treated differently in processing. But it’s too late to change that, since this part of Unicode has been carved into stone. Consequently, if you convert data to NFC or otherwise discard differences between canonically equivalent strings, you risk getting wrong characters.

There are risks that you take by not normalizing. For example, the letter “ä” can appear as a single Unicode character U+00E4 LATIN SMALL LETTER A WITH DIAERESIS or as two Unicode characters U+0061 LATIN SMALL LETTER A U+0308 COMBINING DIAERESIS. It will mostly be the former, i.e. the precomposed form, but if it is the latter and your code tests for data containing “ä”, using the precomposed form only, then it will not detect the latter. But in many cases, you don’t do such things but simply store the data, concatenate strings, print them, etc. Then there is a risk that the two representations result in somewhat different renderings.

It also matters whether your software passes character data to other software somehow. The recipient might expect, due to naive implicit assumptions or consciously and in a documented manner, that its input is normalized.

于 2013-04-13T11:40:11.150 回答
7
  1. NFC 是您应该使用的一般常识形式,ä是 1 个代码点,这是有道理的。

  2. NFD 对某些内部处理很有用——如果您想进行不区分重音的搜索或排序,将字符串放在 NFD 中会使它变得更容易和更快。另一种用法是制作更强大的 slug 标题。这些只是最明显的,我相信还有更多的用途。

  3. 如果两个字符串 x 和 y 是规范等价的,则
    toNFC(x) = toNFC(y)
    toNFD(x) = toNFD(y)

    这是你的意思吗?

于 2013-04-13T10:44:33.037 回答