unicode - UTF-16 reserved codepoints

Question

Why UTF-16 have a reserved range in UCS Database?

UTF-16 is just a way to represent character scalar value using one or two unsigned 16-bits, the layout of these values shouldn't be related to character scalar value because we should apply some algorithm to get the actual character scalar value from such representation.

Let's assume that the reserved range D800-DBFF and DC00-DFFF are not reserved in UCS Database, and there is another representation of UTF-16 that can represent all characters in range 0-7FFF in single unsigned 16-bits and when the high order bit is set then another 16-bit is followed with the remaining bits, and for the byte order mark we will reserve the two possible values and that's it.

If I'm wrong then could you explain it to me.

Thanks

score 7 · Accepted Answer

您提出的方案效率低于当前的代理对方案，这是一个问题。

Currently, only 0xD800-0xDFFF (2048 code units) are "out of bounds" as normal characters, leaving 63488 code units mapping to single characters. Under your proposal, 0x8000-0xFFFF (32768) code units are reserved for multi-code-unit code points, leaving only the other 32768 code units for single-code-unit code points.

I don't know how many code points are currently specified in the basic multilingual plane, but I wouldn't be surprised if it were more than 32768, and of course it can grow. As soon as it's more than 32768, there would be more characters which require two code units to be represented under your proposal than in UTF-16 as it stands.

现在我同意所有这些都不需要 UCS 包含保留范围（在某些方面，这是一种丑陋的含义组合） - 但这样做可以很简单（在代码中）将 UTF-16 映射到 UCS，同时仍然保持非常有效的解决方案。

这样做的缺点很少——UCS 中有足够的空间，所以保留这个小块并不意味着我们未来扩展的空间会大大减少。

假设

这一点是有根据的猜测。您可以进行研究以找出在哪些版本的 Unicode 中使用了哪些字符，但我相信这至少是一个合理的解释。

使用这个特定块的真正原因可能是历史原因 - 很长一段时间以来，Unicode 真的只是16 位，对于所有内容......并且字符已经分配在上限范围内（您的方案认为禁止使用的部分）。通过采用先前未分配的 2048 个值块，所有先前有效的 UCS-2 序列都被保留为具有相同含义的有效 UTF-16 序列，同时将 UCS 范围扩展到 BMP 之外。如果范围是 0xF800-0xFFFF，某些方面可能会更容易，但到那时为时已晚。

score 0 · Accepted Answer

保留代码点D800-DFFF是因为它们在当前的 UTF-16 编码方案中不能表示为它们自己。由于它们在该0000-FFFF范围内，它们将使用一个 UTF-16 代码单元按原样编码。如果允许，当处理器通过 UTF-16 序列向前解码/搜索并遇到D800-0xDBFF范围内的代码单元时，它必须决定该代码单元是代表独立代码点还是代理对的开始。这样做的唯一方法是查看下一个代码单元，看看它是否在DC00-DFFF范围内。类似的，当通过一个序列向后解码/寻找时，如果DC00-DFFF遇到范围内的一个codeunit，则查看下一个codeunit，看它是否在D800-DBFF范围。这使得解码/搜索变得更加困难，并且更容易出错。

为实际字符使用取消保留代码点DB00-DFFF需要对 UTF-16 编码方案进行逻辑更改，以便以不会导致歧义的不同方式转义这些特定代码点。然而，在当前的编码方案下，这样的改变是不可能的，AFAIK。所以它们保持永久保留。

unicode - UTF-16 reserved codepoints

2 回答 2

Related

Reference