php - php 可以检测 4 字节编码的 utf8 字符吗？

Question

我在 mysql 5.1 服务器中使用 utf8 charset mysql 表，它不支持表中的 utf8mb4 编码。当插入 4 字节编码的 utf8 字符时，例如"","","","","","唧","". 该表将弹出错误或跳过以下文本。

如何以编程方式检测 PHP 中 4 字节编码的 utf8 字符并替换它们？

score 18 · Accepted Answer

这应该有效：

if (max(array_map('ord', str_split($string))) >= 240)

合理的是，直到并包括 U+FFFF 的代码点被编码为形式的三个字节1110xxxx 10xxxxxx 10xxxxxx。较高的代码点的形式为11110xxx 10xxxxxx 10xxxxxx 10xxxxxx，即最高字节的值为 240 或更高。如果字符串中有任何这样的字节，则它是 4 字节序列的指示符。

如果要删除长字符，可以这样做：

preg_replace_callback('/./u', function (array $match) {
    return strlen($match[0]) >= 4 ? null : $match[0];
}, $string)

尽管可能有一种更优雅的正则表达式方式来直接表达高代码点。

score 18 · Accepted Answer

以下正则表达式将替换 4 字节 UTF-8 字符：

function replace4byte($string, $replacement = '') {
    return preg_replace('%(?:
          \xF0[\x90-\xBF][\x80-\xBF]{2}      # planes 1-3
        | [\xF1-\xF3][\x80-\xBF]{3}          # planes 4-15
        | \xF4[\x80-\x8F][\x80-\xBF]{2}      # plane 16
    )%xs', $replacement, $string);    
}

var_dump(replace4byte('d'), replace4byte('dd'));

这不依赖于/u修饰符，因此您不必担心编译 PCRE 的 UTF-8。但是，如果您有这种支持，那么 decezepreg_replace_callback会更整洁。

（正则表达式改编自Ensuring valid utf-8 in PHP）

php - php 可以检测 4 字节编码的 utf8 字符吗？

2 回答 2

Related

Reference