php - 如何检测哪种类型的中文编码有文本文件？

Question

在http://www.gnu.org/software/libiconv/上有大约 20 种中文编码：

中文 EUC-CN, HZ, GBK, CP936, GB18030, EUC-TW, BIG5, CP950, BIG5-HKSCS, BIG5-HKSCS:2004, BIG5-HKSCS:2001, BIG5-HKSCS:1999, ISO-2022-CN, ISO -2022-CN-EXT

所以我有一个不是 UTF-8 的文本文件。它是ASCII。我想将其转换为 UTF-8 使用iconv(). 但为此我需要知道源的字符编码。

如果我不会中文怎么办？:(

我注意到：

$str = iconv('GB18030', 'UTF-8', $str);
file_put_contents('file.txt', $str);

生成一个 UTF-8 编码文件，而我尝试的其他编码（CP950、GBK 和 EUC-CN）生成一个 ASCII 文件。这是否意味着iconv能够检测给定字符串的输入编码是否错误？

score 3 · Accepted Answer

这可能会满足您的需求（但我真的不知道）。设置语言环境和 utf8_decode，并使用mb_check_encoding而不是 mt_detect_encoding 似乎提供了一些有用的输出。

// some text from http://chinesenotes.com/chinese_text_l10n.php
// have tried both as string and content loaded from a file
$chinese = '譧躆 礛簼繰 剆坲姏 潧 騔鯬 跠 瘱瘵瘲 忁曨曣 蛃袚觙';
$chinese=utf8_decode($chinese);

$chinese_encodings ='EUC-CN,HZ,GBK,CP936,GB18030,EUC-TW,BIG5,CP950,BIG5-HKSCS,BIG5-HKSCS:2004,BIG5-HKSCS:2001,BIG5-HKSCS:1999,ISO-2022-CN,ISO-2022-CN-EXT';

$encodings = explode(',',$chinese_encodings);

//set chinese locale
setlocale(LC_CTYPE, 'Chinese');

foreach($encodings as $encoding) {
    if (@mb_check_encoding($chinese, $encoding)) {
        echo 'The string seems to be compatible with '.$encoding.'<br>';
    } else {
        echo 'Not compatible with '.$encoding.'<br>';
    }
}

输出

The string seems to be compatible with EUC-CN
The string seems to be compatible with HZ
The string seems to be compatible with GBK
The string seems to be compatible with CP936
Not compatible with GB18030
The string seems to be compatible with EUC-TW
The string seems to be compatible with BIG5
The string seems to be compatible with CP950
Not compatible with BIG5-HKSCS
Not compatible with BIG5-HKSCS:2004
Not compatible with BIG5-HKSCS:2001
Not compatible with BIG5-HKSCS:1999
Not compatible with ISO-2022-CN
Not compatible with ISO-2022-CN-EXT

这是完全的猜测。现在它至少似乎可以识别一些中文编码。如果完全是垃圾，请删除。

score 2 · Accepted Answer

难以检测编码的原因是八位字节序列在几种编码中解码为有效字符，但结果只有在正确的编码中才有意义。在这些情况下，我所做的是获取解码后的文本并转到自动翻译服务，看看你是否得到清晰的文本或混乱的音节。

您可以以编程方式执行此操作，例如通过分析输入文本中的三元组频率。已经创建了像这样的库来解决这个问题，并且有外部程序可以解决这个问题，但我还没有看到任何使用 PHP API 的东西。这种方法虽然不是万无一失的。

score 2 · Accepted Answer

我对中文编码的经验为零，我知道这个问题被标记iconv了，但是如果它可以完成工作，那么您可以尝试mb_detect_encoding来检测您的编码；第二个参数是要检查的编码列表，并且有一个关于中文编码的用户制作的注释：

对于中国开发者：请注意，此函数的第二个参数不包括'GB2312'和'GBK'，当检测为GB2312字符串时返回值为'EUC-CN'。

因此，如果您明确提供完整的中文编码列表作为第二个参数，也许它会起作用？它可以像这样工作：

$encoding = mb_detect_encoding($chineseString, 'GB2312,GBK,(...)');
if($encoding) $utf8text = iconv($encoding, 'UTF-8', $str);

您可能还想使用第三个参数 ( strict)

php - 如何检测哪种类型的中文编码有文本文件？

3 回答 3

Related

Reference