php - 了解 PHP 的 mb_detect_encoding 和 mb_check_encoding 函数的结果

Question

我试图了解这两个函数的逻辑mb_detect_encoding和mb_check_encoding，但文档很差。从一个非常简单的测试字符串开始

$string = "\x65\x92";

使用 Windows-1252 编码时，它是小写的“a”，后跟一个大引号。

我得到以下结果：

mb_detect_encoding($string,"Windows-1252"); // false
mb_check_encoding($string,"Windows-1252"); // true
mb_detect_encoding($string,"ISO-8859-1"); // ISO-8859-1
mb_check_encoding($string,"ISO-8859-1"); // true
mb_detect_encoding($string,"UTF-8",true); // false
mb_detect_encoding($string,"UTF-8"); // UTF-8
mb_check_encoding($string,"UTF-8"); // false

我不明白为什么mb_detect_encoding根据https://en.wikipedia.org/wiki/ISO/IEC_8859-1和https：/ /en.wikipedia.org/wiki/Windows-1252，该字节x92是在 Windows-1252 字符编码中定义的，但不是在 ISO-8859-1 中定义的。
其次，我不明白如何mb_detect_encoding返回false，但mb_check_encoding可以返回true相同的字符串和相同的字符编码。
最后，我不明白为什么字符串可以被检测为 UTF-8，严格模式与否。该字节x92是 UTF-8 中的延续字节，但在此字符串中，它跟随一个有效字符字节，而不是序列的前导字节。

score 1 · Accepted Answer

您的示例很好地说明了为什么mb_detect_encoding应该谨慎使用，因为它不直观，有时在逻辑上是错误的。如果必须使用，请始终作为第三个参数传入strict = true（因此非 UTF8 字符串不会被报告为 UTF-8。

mb_check_encoding按照可能性/优先级的顺序运行一系列所需的编码会更可靠一些。例如：

$encodings = [
    'UTF-8',
    'Windows-1252',
    'SJIS',
    'ISO-8859-1',
];

$encoding = 'UTF-8';
$string = 'foo';
foreach ($encodings as $encoding) {
    if (mb_check_encoding($string, $encoding)) {
        // We'll assume encoding is $encoding since it's valid
        break;
    }
}

不过，排序取决于您的优先级。

php - 了解 PHP 的 mb_detect_encoding 和 mb_check_encoding 函数的结果

1 回答 1

Related

Reference