winapi - windows wchar_t 如何处理基本多语言平面之外的 unicode 字符？

Question

我在这里和其他地方查看了许多其他帖子（见下文），但我仍然没有明确回答这个问题：windows wchar_t 如何处理基本多语言平面之外的 unicode 字符？

那是：

许多程序员似乎觉得 UTF-16 是有害的，因为它是可变长度的代码。
wchar_t 在 windows 上是 16 位宽，但在 Unix/MacOS 上是 32 位宽
Windows API 使用宽字符，而不是 Unicode。

那么当你想在 Windows 上编写类似 (U+2008A) Han Character 的代码时，Windows 会做什么呢？

score 17 · Accepted Answer

Windows 标准库下的实现wchar_t是 UTF-16-oblivious：它只知道大约 16 位代码单元。

因此，您可以将 UTF-16 代理序列放入字符串中，并且可以选择使用更高级别的处理将其视为单个字符。字符串实现不会帮助你，也不会阻碍你；它将允许您在字符串中包含任何代码单元序列，即使是在解释为 UTF-16 时无效的代码单元。

Windows 的许多高级功能确实支持由 UTF-16 代理组成的字符，这就是为什么您可以调用一个文件.txt并看到它正确呈现和正确编辑的原因（只需按下一个按键，而不是两个按键，即可通过字符）在像资源管理器这样支持复杂文本布局（通常使用 Windows 的 Uniscribe 库）的程序中。

但是仍有一些地方可以看到 UTF-16-obliviousness 闪耀，例如您可以.txt在与相同的文件夹中创建一个名为的文件.txt，否则不区分大小写会不允许它，或者您可以创建[U+DC01][U+D801].txt以编程方式。

这就是学究们如何就 Windows “支持” UTF-16 字符串还是只支持 UCS-2 进行冗长且基本上毫无意义的争论。

score 9 · Accepted Answer

Windows 过去使用 UCS-2，但在 Windows 2000 中采用了 UTF-16。Windows wchar_t API 现在生成和使用 UTF-16。

并非所有第三方程序都能正确处理此问题，因此 BMP 之外的数据可能会出现错误。

Also, note that UTF-16, being a variable length encoding, does not conform to the C or C++ requirements for an encoding used with wchar_t. This causes some problems such as some standard functions that take a single wchar_t, such as wctomb, can't handle characters beyond the BMP on Windows, and Windows defining some additional functions that use a wider type in order to be able to handle single characters outside the BMP. I forget what function it was, but I ran into a Windows function that returned int instead of wchar_t (and it wasn't one where EOF was a possible result).

winapi - windows wchar_t 如何处理基本多语言平面之外的 unicode 字符？

2 回答 2

Related

Reference