unicode - 如何确定我正在查看的代码页？

Question

我有一个设备，里面有一些关于如何发送文本的文档。它使用 0x00-0x7F 发送“特殊”字符，如重音字符、欧元符号……

我猜他们复制了现有的代码页并进行了一些更改，但我不知道如何找出最接近我文档中的代码页的代码页。

理论上，这应该很容易做到。例如，他们将 Á 映射到 0x41，所以如果我能找到某种方法来遍历所有代码页并找到在那个位置有这个字符的那些，那将是小菜一碟。

但是，我在互联网上只能找到指向代码页转储的链接，就像我正在查看的那样，或者使用启发式方法读取文本并猜测最可能的代码页的软件。肯定有人可以查看一个正在查看的代码页吗？

score 4 · Accepted Answer

If it uses 0x00 to 0x7F for the "special" characters, how does it encode the regular ASCII characters?

In most of the charsets that support the character Á, its codepoint is 193 (0xC1). If you subtract 128 from that, you get 65 (0x41). Maybe your "codepage" is just the upper half of one of the standard charsets like ISO-8859-1 or windows-1252, with the high-order bit set to zero instead of one (that is, subtracting 128 from each one).

If that's the case, I would expect to find a flag you can set to tell it whether the next bunch of codepoints should be converted using the "upper" or "lower" encoding. I don't know of any system that uses that scheme, but it's the most sensible explanation I can come with for the situation you describe.

score 1 · Accepted Answer

1

What endian is the system? Perhaps you're flipping bit orders?

于 2009-01-06T14:19:14.217 回答

score 1 · Accepted Answer

如果没有其他信息，就无法自动检测代码页。在显示层下面，它只是字节，所有字节都是平等的。没有办法说“我是这个和那个代码页的 0x41”，只有“我是 0x41。给我看看！”</p>

score 0 · Accepted Answer

有点随机的想法，但是如果您可以从设备上复制大量文本，您可以尝试通过类似http://chardet.feedparser.org/detect中的函数来运行它。

score 0 · Accepted Answer

在大多数代码页中，0x41 只是普通的“A”，我不认为任何标准代码页在那个位置都有“Á”。它可能在添加重音的 A 之前的某处有一个控制字符，或者使用非标准代码页。

我认为知道“最近的代码页”没有任何用处，您只需要使用设备随附的文档即可。

您的最后一句话令人费解，您所说的“可能查看一个正在查看的代码页”是什么意思？

如果您包含整个代码页，那么这里的人可能会更有帮助，并让您更深入地了解这个问题，拥有一个数据点 0x41=Á 并没有多大帮助。

unicode - 如何确定我正在查看的代码页？

5 回答 5

Related

Reference