c# - 编码为欧洲字符返回一个字节

Question

如果我将以下字符串编码为 UTF8：

咖啡店

它以 5 个字节而不是 4 个字节返回。如果可能，我希望它返回 4 个字节。

Encoding encoding = Encoding.UTF8;
string testString = "café";
Byte[] bytes = encoding.GetBytes(testString);

回报：

[0] 99
[1] 97
[2] 102
[3] 195
[4] 169

而“cafe”仅返回 4 个字节。

score 3 · Accepted Answer

您不能使用正常的编码方案。

您需要使用所需的代码页创建自定义编码，如下所示：

Encoding encoding = Encoding.GetEncoding(437);
byte[] bytes = encoding.GetBytes("café");

输出：

{ 99, 97, 102, 130 }

é 在代码页 437中为 130 。

假设您要对其进行解码，则需要使用相同的编码对其进行解码。否则你会得到奇怪的结果。

score 3 · Accepted Answer

é 是 Unicode U+00E9。Unicode 字符 U+0080 到 U+07FF 在 UTF8 中占用两个字节。有关详细信息，请参阅http://en.wikipedia.org/wiki/Utf8 。

如果只需要 4 个字节，则不能使用 UTF8。理论上，您可以使用ISO 8859-1，它是一种单字节字符编码。

score 2 · Accepted Answer

UTF-8 中的字符可以占用 1 到 6 个字节。因此，对于您的情况，“é”需要 2 个字节。您可以在此处阅读有关 UTF-8 的更多信息：UTF-8，一种 ISO 10646 的转换格式

score 0 · Accepted Answer

最终转换UTF8为ISO8859-1，现在返回 4 个字节而不是 5 个。

Encoding utf8 = Encoding.UTF8;
string testString = "café";
byte[] utfBytes = utf8.GetBytes(testString); // 5 bytes

Encoding iso = Encoding.GetEncoding("ISO-8859-1");
byte[] isoBytes = iso.GetBytes(testString); // 4 bytes
byte[] convertedUtf8Bytes = Encoding.Convert(utf8, iso, utfBytes); // 4 bytes

string msg = iso.GetString(isoBytes);
string msgConverted = iso.GetString(convertedUtf8Bytes);

Console.WriteLine(msg);
Console.WriteLine(msgConverted);

输出：

咖啡店

c# - 编码为欧洲字符返回一个字节

4 回答 4

Related

Reference