c++ - 如何在 C++ 中迭代 unicode 字符？

Question

我知道要在 C++ 中获取 unicode 字符，我可以这样做：

std::wstring str = L"\u4FF0";

但是，如果我想获取 4FF0 到 5FF0 范围内的所有字符怎么办？是否可以动态构建 unicode 字符？我想到的是这样的伪代码：

for (int i = 20464; i < 24560; i++ { // From 4FF0 to 5FF0
    std::wstring str = L"\u" + hexa(i); // build the unicode character
    // do something with str
}

我将如何在 C++ 中做到这一点？

score 9 · Accepted Answer

wstring 中保存的 wchar_t 类型是整数类型，因此您可以直接使用它：

for (wchar_t c = 0x4ff0;  c <= 0x5ff0;  ++c) {
    std::wstring str(1, c);
    // do something with str
}

尝试使用高于 0xffff 的字符时要小心，因为根据平台（例如 Windows），它们不适合 wchar_t。

例如，如果您想查看字符串中的表情符号块，您可以创建代理对：

std::wstring str;
for (int c = 0x1f600; c <= 0x1f64f; ++c) {
    if (c <= 0xffff || sizeof(wchar_t) > 2)
        str.append(1, (wchar_t)c);
    else {
        str.append(1, (wchar_t)(0xd800 | ((c - 0x10000) >> 10)));
        str.append(1, (wchar_t)(0xdc00 | ((c - 0x10000) & 0x3ff)));
    }
}

score 4 · Accepted Answer

您不能像数组一样递增 Unicode 字符，某些字符是由多个 'char's (UTF-8) 和多个 'WCHAR's (UTF-16) 组成的，这是因为变音符号等。如果你真的很认真关于这些东西，你应该使用像 UniScribe 或 ICU 这样的 API。

一些要阅读的资源：

http://en.wikipedia.org/wiki/UTF-16/UCS-2

http://en.wikipedia.org/wiki/Precomposed_character

http://en.wikipedia.org/wiki/Combining_character

http://scripts.sil.org/cms/scripts/page.php?item_id=UnicodeNames#4d2aa980

http://en.wikipedia.org/wiki/Unicode_equivalence

http://msdn.microsoft.com/en-us/library/dd374126.aspx

score 2 · Accepted Answer

关于什么：

for (std::wstring::value_type i(0x4ff0); i <= 0x5ff0; ++i)
{
    std::wstring str(1, i);
}

请注意，代码尚未经过测试，因此可能无法按原样编译。

此外，考虑到您正在使用的平台，awstring的字符单元可能是 2、4 或 N 字节宽，因此请注意如何使用它。

c++ - 如何在 C++ 中迭代 unicode 字符？

3 回答 3

Related

Reference