utf-8 - Ã©和其他代码

Question

我有一个包含这些代码的文件，我想将它“翻译”成普通字符（我的意思是整个文件）。我该怎么做？

非常感谢您提前。

score 19 · Accepted Answer

看起来您最初有一个 UTF-8 文件，该文件已被解释为 8 位编码（例如ISO-8859-15）和实体编码。我这样说是因为序列 C3A9 看起来像是一个相当合理的 UTF-8 编码序列。

您需要先对其进行实体解码，然后再进行 UTF-8 编码。然后，您可以使用iconv之类的东西转换为您选择的编码。

要完成您的示例：

0xC3A9 = 11000011 10101001 二进制
第一个八位字节中的前导 110 告诉我们这可以解释为 UTF-8 两字节序列。由于第二个八位字节从 10 开始，我们正在寻找可以解释为 UTF-8 的东西。为此，我们取第一个八位字节的最后 5 位，以及第二个八位字节的最后 6 位......
因此，解释为 UTF8 它是 00011101001 = E9 = é （带有 ACUTE的拉丁小写字母 E ）

你提到想用 PHP 处理这个问题，这样的事情可能会为你做：

 //to load from a file, use
 //$file=file_get_contents("/path/to/filename.txt");
 //example below uses a literal string to demonstrate technique...

 $file="&Pr&#xC3;&#xA9;c&#xC3;&#xA9;dent is a French word";
 $utf8=html_entity_decode($file);
 $iso8859=utf8_decode($utf8);

 //$utf8 contains "Précédent is a French word" in UTF-8
 //$iso8859 contains "Précédent is a French word" in ISO-8859

utf-8 - Ã©和其他代码

1 回答 1

Related

Reference