629

根据Wikipedia UTF-8页面,我从人们那里听到了相互矛盾的意见。

它们是一样的,不是吗?有人可以澄清吗?

4

18 回答 18

580

为了扩展其他人给出的答案:

我们有很多语言,有很多字符,计算机应该理想地显示。Unicode 为每个字符分配一个唯一的数字或代码点。

计算机处理诸如字节之类的数字......在这里跳过一点历史并忽略内存寻址问题,8位计算机会将8位字节视为硬件上容易表示的最大数字单位,16位计算机将扩展到两个字节,依此类推。

ASCII 等旧字符编码来自(前)8 位时代,并试图将当时计算中的主要语言,即英语,塞进从 0 到 127(7 位)的数字。字母表中有 26 个字母,包括大写和非大写形式、数字和标点符号,效果都很好。对于其他非英语语言,ASCII 被扩展了第 8 位,但通过此扩展提供的额外 128 个数字/代码点将根据所显示的语言映射到不同的字符。ISO-8859 标准是这种映射最常见的形式;ISO-8859-1 和 ISO-8859-15(也称为 ISO-Latin-1、latin1,是的,8859 ISO 标准也有两个不同的版本)。

但是,当您想要表示来自一种以上语言的字符时,这还不够,因此将所有可用字符塞进一个字节是行不通的。

本质上存在两种不同类型的编码:一种通过添加更多位来扩展值范围。这些编码的示例是 UCS2(2 字节 = 16 位)和 UCS4(4 字节 = 32 位)。它们本质上与 ASCII 和 ISO-8859 标准存在相同的问题,因为它们的值范围仍然有限,即使限制要高得多。

另一种类型的编码使用每个字符的可变字节数,最常见的编码是 UTF 编码。所有 UTF 编码的工作方式大致相同:您选择一个单位大小,UTF-8 为 8 位,UTF-16 为 16 位,UTF-32 为 32 位。然后,该标准将其中一些位定义为标志:如果设置了它们,则单元序列中的下一个单元将被视为同一字符的一部分。如果未设置,则此单位完全代表一个字符。因此,最常见的(英文)字符在 UTF-8 中仅占用一个字节(在 UTF-16 中为两个,在 UTF-32 中为 4 个),但其他语言字符可以占用六个字节或更多。

多字节编码(上面的解释我应该说多单元)的优点是它们相对节省空间,但缺点是查找子字符串、比较等操作都必须将字符解码为 un​​icode 码可以执行此类操作之前的点(尽管有一些快捷方式)。

UCS 标准和 UTF 标准都对 Unicode 中定义的代码点进行编码。理论上,这些编码可以用来编码任何数字(在编码支持的范围内)——当然,这些编码是用来编码 Unicode 代码点的。这就是他们之间的关系。

Windows 将所谓的“Unicode”字符串处理为 UTF-16 字符串,而如今大多数 UNIX 系统默认为 UTF-8。HTTP 等通信协议往往最适合 UTF-8,因为 UTF-8 中的单位大小与 ASCII 中的相同,并且大多数此类协议都是在 ASCII 时代设计的。另一方面,UTF-16 在表示所有现存语言时提供了最佳的平均空间/处理性能。

Unicode 标准定义的代码点少于 32 位可以表示的代码点。因此,出于所有实际目的,UTF-32 和 UCS4 成为相同的编码,因为您不太可能必须处理 UTF-32 中的多单元字符。

希望补充一些细节。

于 2009-03-13T17:37:20.903 回答
346

Let me use an example to illustrate this topic:

A Chinese character:      汉
its Unicode value:        U+6C49
convert 6C49 to binary:   01101100 01001001

Nothing magical so far, it's very simple. Now, let's say we decide to store this character on our hard drive. To do that, we need to store the character in binary format. We can simply store it as is '01101100 01001001'. Done!

But wait a minute, is '01101100 01001001' one character or two characters? You knew this is one character because I told you, but when a computer reads it, it has no idea. So we need some sort of encoding to tell the computer to treat it as one.

This is where the rules of UTF-8 come in: https://www.fileformat.info/info/unicode/utf8.htm

Binary format of bytes in sequence

1st Byte    2nd Byte    3rd Byte    4th Byte    Number of Free Bits   Maximum Expressible Unicode Value
0xxxxxxx                                                7             007F hex (127)
110xxxxx    10xxxxxx                                (5+6)=11          07FF hex (2047)
1110xxxx    10xxxxxx    10xxxxxx                  (4+6+6)=16          FFFF hex (65535)
11110xxx    10xxxxxx    10xxxxxx    10xxxxxx    (3+6+6+6)=21          10FFFF hex (1,114,111)

According to the table above, if we want to store this character using the UTF-8 format, we need to prefix our character with some 'headers'. Our Chinese character is 16 bits long (count the binary value yourself), so we will use the format on row 3 as it provides enough space:

Header  Place holder    Fill in our Binary   Result         
1110    xxxx            0110                 11100110
10      xxxxxx          110001               10110001
10      xxxxxx          001001               10001001

Writing out the result in one line:

11100110 10110001 10001001

This is the UTF-8 binary value of the Chinese character! See for yourself: https://www.fileformat.info/info/unicode/char/6c49/index.htm

Summary

A Chinese character:      汉
its Unicode value:        U+6C49
convert 6C49 to binary:   01101100 01001001
encode 6C49 as UTF-8:     11100110 10110001 10001001

P.S. If you want to learn this topic in Python, click here.

于 2015-01-14T09:07:10.350 回答
227

不幸的是,根据上下文,“Unicode”以各种不同的方式使用。它最正确的用途 (IMO) 是作为编码字符集- 即一组字符以及字符与表示它们的整数代码点之间的映射。

UTF-8是一种字符编码——一种将字节序列转换为字符序列的方法,反之亦然。它涵盖了整个 Unicode 字符集。ASCII 被编码为每个字符一个字节,而其他字符占用更多字节,具体取决于它们的确切代码点(所有当前定义的代码点最多 4 个字节,即最多 U-0010FFFF,实际上 4 个字节可以处理最多U-001FFFFF)。

当“Unicode”用作字符编码的名称时(例如,作为 .NET Encoding.Unicode属性),它通常表示UTF-16,它将最常见的字符编码为两个字节。某些平台(尤其是 .NET 和 Java)使用 UTF-16 作为其“本机”字符编码。如果您需要担心无法以单个 UTF-16 值编码的字符(它们被编码为“代理对”),这会导致棘手的问题 - 但大多数开发人员从不担心这一点,IME。

关于 Unicode 的一些参考资料:

于 2009-03-13T17:11:10.000 回答
119

它们不是一回事——UTF-8 是一种特殊的 Unicode 编码方式。

根据您的应用程序和您打算使用的数据,您可以选择许多不同的编码。据我所知,最常见的是 UTF-8、UTF-16 和 UTF-32。

于 2009-03-13T17:09:23.670 回答
92

Unicode 只定义了代码点,即代表一个字符的数字。您如何将这些代码点存储在内存中取决于您使用的编码。UTF-8 是编码 Unicode 字符的一种方式,等等。

于 2009-03-13T17:14:36.747 回答
38

Unicode是一种标准,它与 ISO/IEC 10646 一起定义通用字符集 (UCS),它是表示几乎所有已知语言所需的所有现有字符的超集。

Unicode为其曲目中的每个字符分配一个名称和一个数字(字符代码代码点)。

UTF-8 编码,是一种在计算机内存中以数字方式表示这些字符的方法。UTF-8 将每个代码点映射到八位字节序列(8 位字节)

例如,

UCS 字符 = Unicode 汉字

UCS 代码点 = U+24B62

UTF-8 编码 = F0 A4 AD A2 (hex) = 11110000 10100100 10101101 10100010 (bin)

于 2013-02-24T18:36:01.737 回答
25

Unicode只是一个标准,它定义了一个字符集 ( UCS ) 和编码 ( UTF ) 来编码这个字符集。但总的来说,Unicode 是指字符集而不是标准。

在 5 分钟内阅读每个软件开发人员绝对、绝对必须了解 Unicode 和字符集(没有借口!)和Unicode 的绝对最低要求。

于 2009-03-13T17:37:07.193 回答
23
于 2014-05-19T13:57:22.447 回答
21

UTF-8 is one possible encoding scheme for Unicode text.

Unicode is a broad-scoped standard which defines over 140,000 characters and allocates each a numerical code (a code point). It also defines rules for how to sort this text, normalise it, change its case, and more. A character in Unicode is represented by a code point from zero up to 0x10FFFF inclusive, though some code points are reserved and cannot be used for characters.

There is more than one way that a string of Unicode code points can be encoded into a binary stream. These are called "encodings". The most straightforward encoding is UTF-32, which simply stores each code point as a 32-bit integer, with each being 4 bytes wide. Since code points only go up to 0x10FFFF (requiring 21 bits), this encoding is somewhat wasteful.

UTF-8 is another encoding, and is becoming the de-facto standard, due to a number of advantages over UTF-32 and others. UTF-8 encodes each code point as a sequence of either 1, 2, 3 or 4 byte values. Code points in the ASCII range are encoded as a single byte value, to be compatible with ASCII. Code points outside this range use either 2, 3, or 4 bytes each, depending on what range they are in.

UTF-8 has been designed with these properties in mind:

  • ASCII characters are encoded exactly as they are in ASCII, such that an ASCII string is also a valid UTF-8 string representing the same characters.

  • More efficient: Text strings in UTF-8 almost always occupy less space than the same strings in either UTF-32 or UTF-16, with just a few exceptions.

  • Binary sorting: Sorting UTF-8 strings using a binary sort will still result in all code points being sorted in numerical order.

  • When a code point uses multiple bytes, none of those bytes contain values in the ASCII range, ensuring that no part of them could be mistaken for an ASCII character. This is also a security feature.

  • UTF-8 can be easily validated, and distinguished from other character encodings by a validator. Text in other 8-bit or multi-byte encodings will very rarely also validate as UTF-8 due to the very specific structure of UTF-8.

  • Random access: At any point in a UTF-8 string it is possible to tell if the byte at that position is the first byte of a character or not, and to find the start of the next or current character, without needing to scan forwards or backwards more than 3 bytes or to know how far into the string we started reading from.

于 2017-09-26T05:05:13.883 回答
18

1. Unicode

There're lots of characters around the world,like "$,&,h,a,t,?,张,1,=,+...".

Then there comes an organization who's dedicated to these characters,

They made a standard called "Unicode".

The standard is like follows:

  • create a form in which each position is called "code point",or"code position".
  • The whole positions are from U+0000 to U+10FFFF;
  • Up until now,some positions are filled with characters,and other positions are saved or empty.
  • For example,the position "U+0024" is filled with the character "$".

PS:Of course there's another organization called ISO maintaining another standard --"ISO 10646",nearly the same.

2. UTF-8

As above,U+0024 is just a position,so we can't save "U+0024" in computer for the character "$".

There must be an encoding method.

Then there come encoding methods,such as UTF-8,UTF-16,UTF-32,UCS-2....

Under UTF-8,the code point "U+0024" is encoded into 00100100.

00100100 is the value we save in computer for "$".

于 2015-01-05T09:28:52.563 回答
13

If I may summarise what I gathered from this thread:

Unicode assigns characters to ordinal numbers (in decimal form). (These numbers are called code points.)

à -> 224

UTF-8 is an encoding that 'translates' these ordinal numbers (in decimal form) to binary representations.

224 -> 11000011 10100000

Note that we're talking about the binary representation of 224, not its binary form, which is 0b11100000.

于 2019-07-18T07:17:46.627 回答
12

我已经检查了 Gumbo 答案中的链接,并且我想将其中的一部分粘贴到 Stack Overflow 上。

“......有些人误以为 Unicode 只是一个 16 位代码,其中每个字符占用 16 位,因此有 65,536 个可能的字符。这实际上是不正确的。这是关于 Unicode 的最常见的神话,所以如果你这么想,不要难过。

事实上,Unicode 对字符有不同的思考方式,你必须了解 Unicode 对事物的思考方式,否则什么都说不通。

到目前为止,我们假设一个字母映射到一些可以存储在磁盘或内存中的位:

A -> 0100 0001

在 Unicode 中,一个字母映射到一个称为代码点的东西,这仍然只是一个理论概念。该代码点如何在内存或磁盘中表示是另一回事......”

“...每个字母表中的每个柏拉图字母都由 Unicode 联盟分配了一个幻数,如下所示:U+0639。这个幻数称为代码点。U+ 表示“Unicode”,数字是十六进制。 U+0639 是阿拉伯字母 Ain。英文字母 A 将是 U+0041...."

"...好吧,假设我们有一个字符串:

你好

在 Unicode 中,它对应于这五个代码点:

U+0048 U+0065 U+006C U+006C U+006F。

只是一堆代码点。数字,真的。我们还没有说明如何将它存储在内存中或在电子邮件中表示它......”

“......这就是编码的用武之地。

导致关于两个字节的神话的 Unicode 编码的最早想法是,嘿,让我们将这些数字分别存储在两个字节中。所以你好变成

00 48 00 65 00 6C 00 6C 00 6F

对?没那么快!难道也不能这样:

48 00 65 00 6C 00 6C 00 6F 00 ? ……”

于 2011-05-30T09:37:52.173 回答
11
于 2019-10-12T04:30:59.687 回答
4

They are the same thing, aren't they?

No, they aren't.


I think the first sentence of the Wikipedia page you referenced gives a nice, brief summary:

UTF-8 is a variable width character encoding capable of encoding all 1,112,064 valid code points in Unicode using one to four 8-bit bytes.

To elaborate:

  • Unicode is a standard, which defines a map from characters to numbers, the so-called code points, (like in the example below). For the full mapping, you can have a look here.

    ! -> U+0021 (21),  
    " -> U+0022 (22),  
    \# -> U+0023 (23)
    
  • UTF-8 is one of the ways to encode these code points in a form a computer can understand, aka bits. In other words, it's a way/algorithm to convert each of those code points to a sequence of bits or convert a sequence of bits to the equivalent code points. Note that there are a lot of alternative encodings for Unicode.


Joel gives a really nice explanation and an overview of the history here.

于 2018-01-11T19:12:34.130 回答
3

UTF-8 is a method for encoding Unicode characters using 8-bit sequences.

Unicode is a standard for representing a great variety of characters from many languages.

于 2018-01-26T13:35:55.527 回答
1

My explanation, after reading numerous posts and articles about this topic:

1 - The Unicode Character Table

"Unicode" is a giant table, that is 21bits wide, these 21bits provide room for 1,114,112 codepoints / values / fields / places to store characters in.

Out of those 1,114,112 codepoints, 1,111,998 are able to store Unicode characters, because there are 2048 codepoints reserved as surrogates, and 66 codepoints reserved as non-characters. So, there are 1,111,998 codepoints that can store a unique character, symbol, emoji and etc.

However, as of now, only 144,697 out of those 1,114,112 codepoints, have been used. These 144,697 codepoints contain characters that cover all of the languages, as well as symbols, emojis and etc.

Each character in the "Unicode" is assigned to a specific codepoint aka has a specific value / Unicode number. For Example the character "❤", has the following value aka Unicode number "U+2764". The value "U+2764" takes exactly one codepoint out of the 1,114,112 codepoints. The value "U+2764" looks like that in binary: "11100010 10011101 10100100", which is exactly 3 bytes or 24bits (without the two empty space characters, each of which taking 1 bit, but I have added them for visual purposes only, in order to make the 24bits more readable, so please ignore them).

Now, how is our computer supposed to know if those 3 byters "11100010 10011101 10100100" are to be read separate or together? If those 3 bytes are read separate and then converted to characters the result would be "Ô, Ø, ñ", which is quite the difference compared to our heart emoji "❤".

2 - Encoding Standarts (UTF-8, ISO-8859, Windows-1251 and etc)

In order to solve this problem people have invented the Encoding Standarts. The most popular one being UTF-8, since 2008. UTF-8 accounts for an average of 97.6% of all web pages, that is why we will UTF-8, for the example below.

2.1 - What is Encoding?

Encoding, simply said means to to convert something, from one thing to another. In our case we are converting data, more specifically bytes to the UTF-8 format, I would also like to rephrase that sentence as: "converting bytes to UTF-8 bytes", although it might not be technically correct.

2.2 Some information about the UTF-8 format, and why it's so important

UTF-8 uses a minimum of 1 bytes to store a character and a maximum of 4 bytes. Thanks to the UTF-8 format we can have characters which take more than 1 byte of information.

This is very important, because if it was not for the UTF-8 format, we would not be able to have such a vast diversity of alphabets, since the letters of some alphabets can't fit into 1 byte, We also wouldn't have emojis at all, since each one requires atleast 3 bytes. I am pretty sure you got the point by now, so let's continue forward.

2.3 Example of Encoding a Chinese character to UTF-8

Now, lets say we have the Chinese character "汉".

This character takes exactly 16 binary bits "01101100 01001001", thus as we discussed above, we can not read this character, unless we encode it to UTF-8, because the computer will have no way of knowing, if these 2 bytes are to be read seperately or together.

Converting this "汉" character's 2 bytes into, as I like to call it UTF-8 bytes, will result in the following:

(Normal Bytes) "01101100 01001001" -> (UTF-8 Encoded Bytes) "11100110 10110001 10001001"

Now, how did we end up with 3 bytes instead of 2? How is that supposed to be UTF-8 Encoding, turning 2 bytes into 3?

In order to explain how the UTF-8 encoding works, I am going to literally copy the reply of @MatthiasBraun, a big shoutout to him for his terrific explanation.

2.4 How does the UTF-8 encoding actually work?

What we have here is the template for Encoding bytes to UTF-8. This is how Encoding happens, pretty exciting if you ask me!

Now, take a good look at the table below and then we are going to go through it together.

        Binary format of bytes in sequence:

        1st Byte    2nd Byte    3rd Byte    4th Byte    Number of Free Bits   Maximum Expressible Unicode Value
        0xxxxxxx                                                7             007F hex (127)
        110xxxxx    10xxxxxx                                (5+6)=11          07FF hex (2047)
        1110xxxx    10xxxxxx    10xxxxxx                  (4+6+6)=16          FFFF hex (65535)
        11110xxx    10xxxxxx    10xxxxxx    10xxxxxx    (3+6+6+6)=21          10FFFF hex (1,114,111)
  1. The "x" characters in the table above represent the number of "Free Bits", those bits are empty and we can write to them.

  2. The other bits are reserved for the UTF-8 format, they are used as headers / markers. Thanks to these headers, when the bytes are being read using the UTF-8 encoding, the computer knows, which bytes to read together and which seperately.

  3. The byte size of your character, after being encoded using the UTF-8 format, depends on how many bits you need to write.

  • In our case the "汉" character is exactly 2 bytes or 16bits:

  • "01101100 01001001"

  • thus the size of our character after being encoded to UTF-8, will be 3 bytes or 24bits

  • "11100110 10110001 10001001"

  • because "3 UTF-8 bytes" have 16 Free Bits, which we can write to

  1. Solution, step by step below:

2.5 Solution:

        Header  Place holder    Fill in our Binary   Result         
        1110    xxxx            0110                 11100110
        10      xxxxxx          110001               10110001
        10      xxxxxx          001001               10001001 

2.6 Summary:

        A Chinese character:      汉
        its Unicode value:        U+6C49
        convert 6C49 to binary:   01101100 01001001
        encode 6C49 as UTF-8:     11100110 10110001 10001001

3 - The difference between UTF-8, UTF-16 and UTF-32

Original explanation of the difference between the UTF-8, UTF-16 and UTF-32 encodings: https://javarevisited.blogspot.com/2015/02/difference-between-utf-8-utf-16-and-utf.html

The main difference between UTF-8, UTF-16, and UTF-32 character encodings is how many bytes they require to represent a character in memory:

UTF-8 uses a minimum of 1 byte, but if the character is bigger, then it can use 2, 3 or 4 bytes. UTF-8 is also compatible with the ASCII table.

UTF-16 uses a minimum of 2 bytes. UTF-16 can not take 3 bytes, it can either take 2 or 4 bytes. UTF-16 is not compatible with the ASCII table.

UTF-32 always uses 4 bytes.

Remember: UTF-8 and UTF-16 are variable-length encodings, where UTF-8 can take 1 to 4 bytes, while UTF-16 will can take either 2 or 4 bytes. UTF-32 is a fixed-width encoding, it always takes 32 bits.

于 2022-01-15T01:23:39.753 回答
0

So you end up here usually from Google, and want to try different stuff.
But how do you print and convert all these character sets?

Here I list a few useful one-liners.

In Powershell:

# Print character with the Unicode point (U+<hexcode>) using this: 
[char]0x2550

# With Python installed, you can print the unicode character from U+xxxx with:
python -c 'print(u"\u2585")'

If you have more Powershell trix or shortcuts, please comment.

In Bash, you'd appreciate the iconv, hexdump and xxd from the libiconv and util-linux packages (probably named differently on other *nix distros.)

# To print the 3-byte hex code for a Unicode character:
printf "\\\x%s" $(printf '═'|xxd -p -c1 -u)
#\xE2\x95\x90

# To print the Unicode character represented by hex string:
printf '\xE2\x96\x85'
#▅

# To convert from UTF-16LE to Unicode
echo -en "════"| iconv -f UTF-16LE -t UNICODEFFFE

# To convert a string into hex: 
echo -en '═�'| xxd -g 1
#00000000: e2 95 90 ef bf bd

# To convert a string into binary:
echo -en '═�\n'| xxd -b
#00000000: 11100010 10010101 10010000 11101111 10111111 10111101  ......
#00000006: 00001010

# To convert a binary string into hex:
printf  '%x\n' "$((2#111000111000000110000010))"
#e38182

于 2022-01-04T14:50:54.587 回答
-1

As a simple answer that gets straight to the point:

  • Unicode is a standard for representing characters from many human languages.
  • UTF-8 is a method for encoding Unicode characters.

* Yes: I'm overlooking the inner workings of UTF-8 on purpose.

于 2021-11-10T21:52:53.957 回答