windows - Windows 上 MBCS 和 UTF-8 之间的区别

Question

我正在阅读有关 Windows 上的字符集和编码的信息。我注意到 Visual Studio 编译器（用于 C++）中有两个编译器标志，称为 MBCS 和 UNICODE。它们之间有什么区别？我没有得到的是 UTF-8 在概念上与 MBCS 编码有何不同？另外，我在MSDN中找到了以下引用：

Unicode 是 16 位字符编码

这否定了我读到的关于 Unicode 的任何内容。我认为 unicode 可以使用不同的编码进行编码，例如 UTF-8 和 UTF-16。有人可以进一步阐明这种混乱吗？

score 113 · Accepted Answer

我注意到 Visual Studio 编译器（用于 C++）中有两个编译器标志，称为 MBCS 和 UNICODE。它们之间有什么区别？

Windows API 中的许多函数有两个版本：一个接受char参数（在特定于语言环境的代码页中），另一个接受wchar_t参数（在 UTF-16 中）。

int MessageBoxA(HWND hWnd, const char* lpText, const char* lpCaption, unsigned int uType);
int MessageBoxW(HWND hWnd, const wchar_t* lpText, const wchar_t* lpCaption, unsigned int uType);

这些函数对中的每一个也有一个不带后缀的宏，这取决于是否UNICODE定义了宏。

#ifdef UNICODE
   #define MessageBox MessageBoxW
#else
   #define MessageBox MessageBoxA
#endif

为了完成这项工作，TCHAR定义类型以抽象出 API 函数使用的字符类型。

#ifdef UNICODE
    typedef wchar_t TCHAR;
#else
    typedef char TCHAR;
#endif

然而，这是个坏主意。您应该始终明确指定字符类型。

我没有得到的是 UTF-8 在概念上与 MBCS 编码有何不同？

MBCS 代表“多字节字符集”。从字面上看，UTF-8 似乎符合条件。

但在 Windows 中，“MBCS”仅指可与“A”版本的 Windows API 函数一起使用的字符编码。这包括代码页 932 (Shift_JIS)、936 (GBK)、949 (KS_C_5601-1987) 和 950 (Big5)，但不包括 UTF-8。

要使用 UTF-8，您必须使用将字符串转换为 UTF-16 MultiByteToWideChar，调用函数的“W”版本，然后调用WideCharToMultiByte输出。这本质上就是“A”函数实际所做的，这让我想知道为什么 Windows 不只支持 UTF-8。

由于无法支持最常见的字符编码，因此“A”版本的 Windows API 毫无用处。因此，您应该始终使用“W”功能。

Unicode 是 16 位字符编码

这否定了我读到的关于 Unicode 的任何内容。

MSDN 是错误的。Unicode 是一个 21 位编码字符集，有多种编码，最常见的是 UTF-8、UTF-16 和 UTF-32。（还有其他 Unicode 编码，例如 GB18030、UTF-7 和 UTF-EBCDIC。）

每当微软提到“Unicode”时，它们实际上是指 UTF-16（或 UCS-2）。这是出于历史原因。Windows NT 是 Unicode 的早期采用者，当时人们认为 16 位对每个人来说都足够了，而 UTF-8 仅在 Plan 9 中使用。所以 UCS-2是Unicode。

score 18 · Accepted Answer

18

于 2012-10-22T12:14:45.080 回答

score 12 · Accepted Answer

MBCS表示多字节字符集，它描述了将字符编码为（可能）超过 1 个字节的任何字符集。

ANSI / ASCII字符集不是多字节的。

然而， UTF-8是一种多字节编码。它将任何 Unicode 字符编码为 1、2、3 或 4 个八位字节（字节）的序列。

然而，UTF-8 只是 Unicode 字符集几种可能的具体编码中的一种。值得注意的是，UTF-16 是另一种编码，恰好是 Windows / .NET (IIRC) 使用的编码。这是 UTF-8 和 UTF-16 之间的区别：

UTF-8 将任何 Unicode 字符编码为 1、2、3 或 4 个字节的序列。
UTF-16 将大多数 Unicode 字符编码为 2 个字节，有些编码为 4 个字节。

因此，Unicode 是 16 位字符编码是不正确的。它有点像 21 位编码（或者现在甚至更多），因为它包含一个代码点U+000000高达U+10FFFF.

score 5 · Accepted Answer

As a footnote to the other answers, MSDN has a document Generic-Text Mappings in TCHAR.H with handy tables summarizing how the preprocessor directives _UNICODE and _MBCS change the definition of different C/C++ types.

As to the phrasing "Unicode" and "Multi-Byte Character Set", people have already described what the effects are. I just want to emphasize that both of those are Microsoft-speak for some very specific things. (That is, they mean something less general and more particular-to-Windows than one might expect if coming from a non-Microsoft-specific understanding of text internationalization.) Those exact phrases show up and tend to get their own separate sections/subsections of microsoft technical documents, e.g. in Text and Strings in Visual C++

windows - Windows 上 MBCS 和 UTF-8 之间的区别

4 回答 4

Related

Reference