我发现 C 标准(C99 和 C11)在字符/字符串代码位置和编码规则方面含糊不清:
首先,标准定义了the source character set
和the execution character set
。本质上,它提供了一组字形,但没有将任何数值与它们相关联——那么默认字符集是什么?
我不是在问这里的编码,而只是在数字/代码点映射的字形/曲目。它确实定义universal character names
为 ISO/IEC 10646,但它是否说这是默认字符集?
作为上述内容的扩展 - 我找不到任何说明数字转义序列 \0 和 \x 代表什么字符的内容。
从 C 标准(C99 和 C11,我没有检查 ANSI C)我得到以下关于字符和字符串文字的信息:
+---------+-----+------------+----------------------------------------------+
| Literal | Std | Type | Meaning |
+---------+-----+------------+----------------------------------------------+
| '...' | C99 | int | An integer character constant is a sequence |
| | | | of one or more multibyte characters |
| L'...' | C99 | wchar_t | A wide character constant is a sequence of |
| | | | one or more multibyte characters |
| u'...' | C11 | char16_t | A wide character constant is a sequence of |
| | | | one or more multibyte characters |
| U'...' | C11 | char32_t | A wide character constant is a sequence of |
| | | | one or more multibyte characters |
| "..." | C99 | char[] | A character string literal is a sequence of |
| | | | zero or more multibyte characters |
| L"..." | C99 | wchar_t[] | A wide string literal is a sequence of zero |
| | | | or more multibyte characters |
| u"..." | C11 | char16_t[] | A wide string literal is a sequence of zero |
| | | | or more multibyte characters |
| U"..." | C11 | char32_t[] | A wide string literal is a sequence of zero |
| | | | or more multibyte characters |
| u8"..." | C11 | char[] | A UTF-8 string literal is a sequence of zero |
| | | | or more multibyte characters |
+---------+-----+------------+----------------------------------------------+
但是,我找不到有关这些文字的编码规则的任何信息。 UTF-8 似乎确实暗示了 UTF-8 编码,但我认为它在任何地方都没有明确提及。另外,对于其他类型,编码是未定义的还是依赖于实现的?
我不熟悉 UNIX 规范。UNIX 规范是否对这些规则指定了任何附加约束?
另外,如果有人能告诉我GCC 和 MSVC 使用什么字符集/编码方案,那也会有所帮助。