c# - What happens to a null byte when converting bytes to ISO 8859-1 encoding?

Question

I'm not entirely sure if the question even makes sense. I'm converting a byte array taken from an ID3 tag and converting it to a string. Most text frames in an ID3 tag use ISO 8859-1 encoding but it depends on the frame. In any case, if you look up what 0x00 is in the ISO 8859-1 codes it is invalid.

To further complicate, either due programmer error or just poor formatting, some of the strings end in 0x00 and some do not.

When converting a series of bytes into a string using ISO 8859-1 encoding do you have manually check the end of the string to see if it is a null? Or will the encoding object through whatever method it uses to convert in the first place deal with the null properly? Furthermore, is there some sort of function that could normalize or "fix" the null terminated string?

When you try to display these strings they do not display properly.

I am using C# for this particular project. Some extra info here about ID3 Tags: ID3 Specs

Or am I completely misunderstanding the whole thing? Is a null terminator simply a way a particular language handles strings and it has nothing to do with encoding?

Edit: I used System.Text.Encoding.GetEncoding("iso-8859-1") followed by a GetString call

score 5 · Accepted Answer

如果使用Encoding.GetEncoding(28591)，它只是将字节 0 转换为 Unicode U+0000。编码通常假定它们必须转换所有字节——它们不寻找终止符。

这种将 0 视为 Unicode 0 的处理方式符合Wikipedia 描述：

1992 年，IANA 注册了字符映射 ISO_8859-1:1987，通常以其首选的 MIME 名称 ISO-8859-1（注意 ISO 8859-1 上的额外连字符）作为 ISO 8859-1 的超集而闻名，用于在互联网上使用。此映射将C0 和 C1 控制字符分配给未分配的代码值，因此通过每个可能的 8 位值提供 256 个字符。

C0 和 C1 控制字符页面包括：

0：最初用于允许在纸带上留下间隙以进行编辑。后来用于在可能需要终端处理一些时间的代码之后填充（例如，打印终端上的回车或换行）。现在经常用作字符串终止符，尤其是在 C 编程语言中。

示例代码：

using System;
using System.Text;

class Program
{
    static void Main(string[] args)
    {
        byte[] data = { 0, 0 };
        Encoding latin1 = Encoding.GetEncoding(28591);

        string text = latin1.GetString(data);
        Console.WriteLine(text.Length); // 2
        Console.WriteLine((int) text[0]); // 0
        Console.WriteLine((int) text[1]); // 0
    }
}

score 0 · Accepted Answer

令人高兴的是，ASCII、ISO-8859-1 和 Unicode 都同意 0..127 范围内的代码点。因此，您的字符'\0'将以 ASCII、ISO-8859-1 和 UTF-8 进行相同的编码。

如果您的程序为零字节分配了特殊的语义，您必须适当地处理它。

c# - What happens to a null byte when converting bytes to ISO 8859-1 encoding?

2 回答 2

Related

Reference