encoding - 文件编码和变形字符串

Question

我只是在处理一个文本文件，其中包含许多变形的字符串，例如：

VyplÅ<88>te prosÃm pole "jmÃ©no

我的编辑说文件编码是 latin1。该字符串应该是一个包含一些变音符号的捷克语句子，所以难怪它显示错误。我试图在我的编辑器中强制使用 utf8 和 latin2 编码，但这没有帮助。我也尝试使用 iconv 将文件从 latin1 转换为 utf8 或 latin2 ，但都没有帮助。我经常遇到这样的问题，除了手动重写字符串之外，我不知道任何其他解决方案。有没有更好的方法来解决这个问题？

编辑：

以下是原文：

Vyplňte prosím pole "jméno"

这是出现格式错误的字符串的部分的十六进制转储：

0002640: 6a6d 656e 6f22 5d20 3d20 2744 453a 2056  jmeno"] = 'DE: V
0002650: 7970 6cc5 8874 6520 7072 6f73 c3ad 6d20  ypl..te pros..m 
0002660: 706f 6c65 2022 6a6d c3a9 6e6f 222e 273b  pole "jm..no".';

编辑2：

正如 deceze 所说，上面的句子确实是正确的 utf8。但我刚刚发现了一些奇怪的事情。如果我尝试将文件从 utf8 转码为 utf8（使用 iconv），我会收到一个单词错误：Postgebührat character ü。如果我查看十六进制转储，这个字符表示为\xfc（十进制的 252），这是有效的 latin1 字节编码，ü但完全无效的 utf8 字节编码。似乎文件的一部分在 latin1 中，另一部分在 utf8 中。这是 latin1 中的文件的一部分（可能）：

0000250: 506f 7374 6765 62fc 6872 273b 0a09 0963  Postgeb.hr';...c
0000260: 6f6e 665b 2277 6166 6572 7322 5d20 3d20  onf["wafers"] = 
0000270: 2744 453a 206f 706c c3a1 746b 20c3 273b  'DE: opl..tk .';

当我对此进行更多研究时，这甚至似乎不是有效的 latin1 原因，即使在 latin1 中它也是乱码（DE: oplÃ¡tk Ã而不是可能DE: oplatky za）。这部分文件似乎包含一些损坏的文本。

I can't understand how encoding in this file could have got mixed up like that. Any ideas?

score 2 · Accepted Answer

If the file is supposed to contain Latin2 encoded text, then trying to convert it from Latin1 or similar is of course messing things up.

The problem is simply that your text editor does not automagically recognize the encoding, since the single-byte Latin* encodings all look identically interchangeable on a byte level. If your editor "tells" you the encoding is Latin1, what it means is that it is currently interpreting the file as Latin1. Obviously it has that wrong.

You either need to tell your editor to treat the file as Latin2 (Open As... Latin2, or however your editor gives you this choice) or to convert the file from Latin2 into an encoding your editor handles correctly.

To understand encodings better, I recommend you read What Every Programmer Absolutely, Positively Needs To Know About Encodings And Character Sets To Work With Text.

In response to your posted hex dump: That file is UTF-8 encoded.

score 0 · Accepted Answer

Iconv is the way to go, but you must know the correct enconding. Latin2 (iso8859-2) is only one of the possibilities, since there were many ascii extensions in Europe. What language is this supposed to be in?

encoding - 文件编码和变形字符串

2 回答 2

Related

Reference