php - preg_match 在包含非 UTF8 字符的二进制字符串的开头找不到 UTF-8 字符

Question

如果字符串中的某处是非 UTF8 字符，则带有修饰符 u 的 preg_match 将返回 false 以表示错误。例如：

<?php
$string = "ABCD\xc3";
$r = preg_match('/^./u',$string, $match);
var_dump($r);  //bool(false)

这个例子自己试试：https ://3v4l.org/qkHl4

如果最后删除了非 UTF8 字符，则正则表达式会查找第一个字符。

$string = "ABCD";
$r = preg_match('/^./u',$string, $match);
var_dump($r, $match); 
//int(1) array(1) { [0]=> string(1) "A" }

是否有一种简单的方法可以使用正则表达式在开头识别还包含非 UTF8 字符的字符串的 UTF-8 字符？

score 0 · Accepted Answer

根据这个答案，您可以使用 mb_convert_encoding删除无效的 utf 字符：

$string = "ABCD\xc3";
$string = mb_convert_encoding($string, 'UTF-8', 'UTF-8');
$r = preg_match('/^./u', $string, $match);
var_dump($r, $match);

给出以下结果：

int(1)
array(1) {
  [0] =>
  string(1) "A"
}

score 0 · Accepted Answer

您还可以考虑使用T-Regx，它以更协作的方式处理 UTF8 错误：

try {
    pattern('^.', 'u')->match("ABCD\xc3")->all();
catch (SafeRegexException $e) {
    // handle
}

score 0 · Accepted Answer

我想经过长时间的搜索，我自己找到了答案。

仅当整个字符串是有效的 UTF-8 字符串时，修饰符 u 才有效。即使只找到第一个字符，也会首先检查整个字符串。修饰符 u 不能用于此问题。但是，可以使用正则表达式。

function utf8Char($string){
    $ok = preg_match(
      '/^[\xF0-\xF7][\x80-\xBF][\x80-\xBF][\x80-\xBF]
      |^[\xE0-\xEF][\x80-\xBF][\x80-\xBF]
      |^[\xC0-\xDF][\x80-\xBF]
      |^[\x00-\x7f]/sx',
      $string,
      $match);
    return $ok ? $match[0] : false;      
}


var_dump(utf8char("€a\xc3def"));  //string(3) "€"
var_dump(utf8char("a\xc3def"));  //string(1) "a"
var_dump(utf8char("\xc3def"));  //bool(false)

可以使用 substr 函数检索非 UTF8 字节。

var_dump(substr("\xc3def",0,1)); //string(1) "�"

php - preg_match 在包含非 UTF8 字符的二进制字符串的开头找不到 UTF-8 字符

3 回答 3

Related

Reference