3

我正在从流中读取字节序列。假设为了论证,序列是固定长度的,我将整个内容读入一个字节数组(在我的情况下是这样,vector<char>但这对于这个问题并不重要)。这个字节序列包含一个字符串,可以是 utf-16 或 utf-8 编码。不幸的是,没有迹象表明它是哪一个。

我可以验证字节序列是否代表有效的 utf-16 编码以及它是否代表有效的 utf-8 编码,但我也可以想象相同的字节序列如何可能是有效的 utf-8 和有效的 utf-16同时。

那么,这是否意味着无法普遍确定它是哪一个?

4

2 回答 2

3

If the contents are expected to be written in a language using the Latin script, simply counting nulls will detect UTF-16. In UTF-8, null bytes will decode to NUL control character, and they don't appear in text normally.

Languages written in other scripts cannot be fully valid in both UTF-16 and UTF-8 unless it's artificially constructed to be so.

So, first detect if it's fully valid UTF-8 sequence on its own:

  • If yes, check for null bytes, and if there are some, it's UTF-16. Otherwise it's UTF-8.
  • If not, it's UTF-16.

If the above resulted in UTF-16, that's not enough as you have to know the endianess as well. With languages written in Latin script, the amount of odd or even null bytes will tell this.

于 2013-01-07T13:37:05.480 回答
2
于 2013-01-07T13:05:10.640 回答