unicode - 代理对是在 UTF-16 中表示大于 2 个字节的代码点的唯一方法吗？

Question

我知道这可能是一个愚蠢的问题，但我需要确定这个问题。所以我需要知道，例如，如果一种编程语言说它的 String 类型使用 UTF-16 编码，这是否意味着：

它将使用 2 个字节作为 U+0000 到 U+FFFF 范围内的代码点。
它将对大于 U+FFFF（每个代码点 4 个字节）的代码点使用代理对。

还是某些编程语言在编码时使用了自己的“技巧”并且没有100％遵循此标准。

score 3 · Accepted Answer

UTF-16 is a specified encoding, so if you "use UTF-16", then you do what it says and don't invent any "tricks" of your own.

I wouldn't talk about "two bytes" the way you do, though. That's a detail. The key part of UTF-16 is that you encode code points as a sequence of 16-bit code units, and pairs of surrogates are used to encode code points greater than 0xFFFF. The fact that one code unit is comprised of two 8-bit bytes is a second layer of detail that applies to many systems (but there are systems with larger byte sizes where this isn't relevant), and in that case you may distinguish big- and little-endian representations.

But looking the other direction, there's absolutely no reason why you should use UTF-16 specifically. Ultimately, Unicode text is just a sequence of numbers (of value up to 2²¹), and it's up to you how to represent and serialize those.

I would happily make the case that UTF-16 is a historic accident that we probably wouldn't have done if we had to redo everything now: It is a variable-length encoding just as UTF-8, so you gain no random access, as opposed to UTF-32, but it is also verbose. It suffers endianness problems, unlike UTF-8. Worst of all, it confuses parts of the Unicode standard with internal representation by using actual code point values for the surrogate pairs.

The only reason (in my opinion) that UTF-16 exists is because at some early point people believed that 16 bit would be enough for all humanity forever, and so UTF-16 was envisaged to be the final solution (like UTF-32 is today). When that turned out not to be true, surrogates and wider ranges were tacked onto UTF-16. Today, you should by and large either use UTF-8 for serialization externally or UTF-32 for efficient access internally. (There may be fringe reasons for preferring maybe UCS-2 for pure Asian text.)

score 1 · Accepted Answer

UTF-16 本身是标准的。然而，大多数字符串基于 16 位代码单元的语言（无论它们是否声称“支持”UTF-16）都可以使用任何代码单元序列，包括无效代理。例如，这通常是可接受的字符串文字：

"x \uDC00 y \uD800 z"

通常只有在尝试将其写入另一种编码时才会出现错误。

Python 的可选编码/解码选项surrogateescape使用此类无效代理将表示单个字节 0x80–0xFF 的令牌走私到独立代理代码单元 U+DC80–U+DCFF 中，从而产生这样的字符串。这通常只在内部使用，因此您不太可能在文件或网络上遇到它；它仅适用于 UTF-16，因为 Python 的str数据类型基于 16 位代码单元（在 3.0 和 3.3 之间的“窄”构建上）。

我不知道任何其他常用的 UTF-16 扩展/变体。

unicode - 代理对是在 UTF-16 中表示大于 2 个字节的代码点的唯一方法吗？

2 回答 2

Related

Reference