c++ - Windows 上代理对（BMP 中的 Unicode 字符）的 wchar_t* 大小

Question

我在 Windows 8 上遇到了一个有趣的问题。我测试了我可以用 wchar_t* 字符串表示 BMP 之外的 Unicode 字符。以下测试代码为我产生了意想不到的结果：

const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character

int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows.

int i2 = sizeof(s1); // i2 == 4, because of the terminating '\0' (I guess).
int i3 = sizeof(s2); // i3 == 4, why?

U+2008A 是Han 字符，它不在 Binary Multilingual Pane 中，因此它应该由 UTF-16 中的代理对表示。这意味着 - 如果我理解正确的话 - 它应该由两个 wchar_t 字符表示。所以我预计 sizeof(s2) 为 6（代理对的两个 wchar_t-s 为 4，终止 \0 为 2）。

那么为什么 sizeof(s2) == 4 呢？我测试了s2字符串构造正确，因为我用DirectWrite渲染过，汉字符显示正确。

更新：正如 Naveen 指出的，我试图错误地确定数组的大小。以下代码产生正确的结果：

const wchar_t* s1 = L"a";
const wchar_t* s2 = L"\U0002008A"; // The "Han" character

int i1 = sizeof(wchar_t); // i1 == 2, the size of wchar_t on Windows.

std::wstring str1 (s1);
std::wstring str2 (s2);

int i2 = str1.size(); // i2 == 1.
int i3 = str2.size(); // i3 == 2, because two wchar_t characters needed for the surrogate pair.

score 9 · Accepted Answer

sizeof(s2)返回存储指针s2或任何其他指针所需的字节数，在您的系统上为 4 个字节。~~它与存储在~~指向的字符无关s2。

score 4 · Accepted Answer

sizeof(wchar_t*)与相同sizeof(void*)，即指针本身的大小。在 32 位系统上总是 4，在 64 位系统上总是 8。您需要使用wcslen()orlstrlenW()代替sizeof()：

const wchar_t* s1 = L"a"; 
const wchar_t* s2 = L"\U0002008A"; // The "Han" character 

int i1 = sizeof(wchar_t); // i1 == 2
int i2 = wcslen(s1); // i2 == 1
int i3 = wcslen(s2); // i3 == 2

score 0 · Accepted Answer

答案的附录。
RE: 解开问题更新中使用的不同单位i1和i2, i3。

i1值 2 是以字节
i2为单位的大小值 1 是以wchar_t为单位的大小，IOW 4 字节（假设sizeof(wchar_t)为 4）。
i3值 2 是wchar_t中的大小，IOW 8 字节

c++ - Windows 上代理对（BMP 中的 Unicode 字符）的 wchar_t* 大小

3 回答 3

Related

Reference