c++ - 如何输入 4 字节的 UTF-8 字符？

Question

我正在编写一个小应用程序，我需要使用不同字节长度的 utf-8 字符进行测试。

我可以输入 unicode 字符来测试以 utf-8 编码的 1,2 和 3 个字节，例如：

string in = "pi = \u3a0";

但是如何获得一个用 4 字节编码的 unicode 字符呢？我试过了：

string in = "aegan check mark = \u10102";

据我了解应该是输出。但是当我打印出来时，我得到ᴶ0

我错过了什么？

编辑：

我通过添加前导零让它工作：

string in = "\U00010102";

希望我早点想到这一点:)

score 6 · Accepted Answer

模式中有更长的转义形式，\U后跟八位数字，而不是\u后跟四位数字。这也用于 Java 和 Python 等：

>>> '\xf0\x90\x84\x82'.decode("UTF-8")
u'\U00010102'

但是，如果您使用的是字节字符串，为什么不像上面那样转义每个字节，而不是依靠编译器将转义转换为 UTF-8 字符串？这似乎也更便携 - 如果我编译以下程序：

#include <iostream>
#include <string>

int main()
{
    std::cout << "narrow: " << std::string("\uFF0E").length() <<
        " utf8: " << std::string("\xEF\xBC\x8E").length() <<
        " wide: " << std::wstring(L"\uFF0E").length() << std::endl;

    std::cout << "narrow: " << std::string("\U00010102").length() <<
        " utf8: " << std::string("\xF0\x90\x84\x82").length() <<
        " wide: " << std::wstring(L"\U00010102").length() << std::endl;
}

在 win32 上使用我当前的选项 cl 给出：

warning C4566: character represented by universal-character-name '\UD800DD02' cannot be represented in the current code page (932)

编译器尝试将字节字符串中的所有 unicode 转义转换为系统代码页，与 UTF-8 不同，它不能表示所有 unicode 字符。奇怪的是，它理解\U00010102为\uD800\uDD02UTF-16（其内部 unicode 表示）并破坏了错误消息中的转义...

运行时，程序打印：

narrow: 2 utf8: 3 wide: 1
narrow: 2 utf8: 4 wide: 2

请注意，UTF-8 字节串和宽字符串是正确的，但编译器转换失败"\U00010102"，给出了字节串"??"，结果不正确。

c++ - 如何输入 4 字节的 UTF-8 字符？

1 回答 1

Related

Reference