c++ - UTF-8 Unicode 编码和国家特定字符

Question

我对编码有点迷茫。我不明白为什么我们说“UTF-8 Unicode”。“Unicode”对我来说听起来像是世界上所有可能的字符，它不适合单字节字符。

你能给我解释一下吗？

第二个问题：如果我决定在程序中使用带有“UTF-8 Unicode”编码的单字节字符，我能处理大多数欧洲字符吗？俄罗斯，阿拉伯，中国等呢？

谢谢你的帮助。

score 3 · Accepted Answer

UTF-8 仅对某些字符使用单个字节——基本的西方字母、数字和标点符号。其他字符占用多个字节。

一个简单的英文字符串，例如“Hello world！” 每个字符占用一个字节。包括一个重音字符，如“Café”中的字符，该字符将占用一个以上的字节。

相关维基百科文章的“描述”部分对其进行了很好的描述。

score 3 · Accepted Answer

In the following, I use the term "character" to denote something that can be displayed on a screen and printed on paper by a computer. The official name in Unicode is "code-point". The letter 'a' is a code-point - it is "character" number 97 (0x61), so is a 'ྦྷ' (character 4007, 0xfa7)

Unicode as such encodes just about every known character in every language known on this planet. The coding starts with traditional English/American characters and control character in the first 128 characters (0..127). The next 128 covers a bunch of European letters such as accented and umlauted characters (é, Ä, ö) and some special character (£, €, etc). Then higher numbers cover "less European" languages such as Russian, Japanese, Chinese, Thai, Urdu, Arabic, Hebrew, etc, etc [I'm not sure exactly in which order these are].

The numbers go into millions.

You can look at the different characters for example here.

UTF-8 uses 8 bits per "token". The first 128 characters are encoded straight away as 0..127. Everything else starts with 11xxxxxx in binary. The first character actually tells you how many further characters (up to 5), by using more and more 1's in the beginning, and each subsequent character is encoded as 10xxxxxx. There is ALWAYS a 0 between the last "this is special character" and the "actual data". So for example, a 2-byte combination will have 11*0*xxxxx 10yyyyyy, where xxxxxyyyyyy is the binary code of the character.

UTF-16 works according exactly the same principle, except each "token" is 16 bits. In UTF-16, the range 0xD800-DFFF to encode "longer than 16 bits" encodings. You can read more in the Wikipedia article here (I've not worked much with UTF-16).

score 2 · Accepted Answer

根据维基百科，

Unicode 是一种计算行业标准，用于对世界上大多数书写系统中表达的文本进行一致的编码、表示和处理。

UTF-8 是 Unicode 的一部分，它描述了一种编码。它可以编码 Unicode 标准中所有约 1,000,000 个字符。“8”在那里是因为每个字符都是使用 8 位的倍数编码的。

例如，“A”用十六进制编码为“41”，“é”为“C3 A9”，“猫”为“E7 8C AB”。

c++ - UTF-8 Unicode 编码和国家特定字符

3 回答 3

Related

Reference