utf-8 - 无效的 UTF-8 字节

Question

并非所有字节序列都是有效的 UTF-8。应为以下情况准备 UTF-8 解码器：

1. the red invalid bytes in the above table
2. an unexpected continuation byte
3. a start byte not followed by enough continuation bytes
4. an Overlong Encoding as described above
5. A 4-byte sequence (starting with 0xF4) that decodes to a value greater than U+10FFFF

根据代码页布局，0xC0 和 0xC1 是无效的，并且绝不能出现在有效的 UTF-8 序列中。这是我对 CodePoints 0xC0 和 0xC1 的内容：

Byte 2   Byte 1      Num   Char
11000011 10000000    192   À
11000011 10000001    193   Á

这些字节序列有对应的字符，但不应该有。我做错了吗？

score 9 · Accepted Answer

你只是混淆了术语：

代码点U+ 00C0是字符“À”，U+00C1 是“Á”。
以 UTF-8 编码，它们分别是字节序列 C3 80和C3 81。

字节 C0和C1不应出现在 UTF-8 编码中。

代码点表示独立于字节的字符。字节就是字节。

utf-8 - 无效的 UTF-8 字节

1 回答 1

Related

Reference