php - 如何使用 preg_match 在多字节字符串中获取正确的列表位置

Question

我目前正在使用以下代码匹配 HTML：

preg_match('/<\/?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;/u', $html, $match, PREG_OFFSET_CAPTURE, $position)

它与一切完美匹配，但是如果我有一个多字节字符，则在返回位置时将其计为 2 个字符。

例如，返回的$match数组将给出如下内容：

array
  0 => 
    array
      0 => string '<br />' (length=6)
      1 => int 132
  1 => 
    array
      0 => string 'br' (length=2)
      1 => int 133

匹配的实数<br />是 128，但是有 4 个多字节字符，所以它给出 132。我真的认为添加 /u 修饰符可以让它知道发生了什么，但没有运气。

score 3 · Accepted Answer

我从@Qtax 看到了这个建议：

preg_match_all (PHP) 中的 UTF-8 字符

为了获得更多参考，这个错误在使用这个时浮出水面： Truncate text contains HTML, ignoring tags

改变的要点是这样的：

$orig_utf = 'UTF-8';
$new_utf  = 'UTF-32';

mb_regex_encoding( $new_utf );

$html     = mb_convert_encoding( $html, $new_utf, $orig_utf );
$end_char = mb_convert_encoding( $end_char, $new_utf, $orig_utf );


mb_ereg_search_init( $html );

$pattern = '</?([a-z]+)[^>]*>|&#?[a-zA-Z0-9]+;';
$pattern = mb_convert_encoding( $pattern, $new_utf, $orig_utf );

while ( $printed < $limit && $tag_match = mb_ereg_search_pos( $pattern, $html ) ) {

  $tag_position = $tag_match[0]/4;
  $tag_length   = $tag_match[1];
  $tag          = mb_substr( $html, $tag_position, $tag_length/4, $new_utf );
  $tag_name     = preg_replace( '/[\s<>\/]+/', '', $tag );

  // Print text leading up to the tag.
  $str = mb_substr($html, $position, $tag_position - $position, $new_utf );

  .......

}

此外，关于截断 HTML 页面，还有其他必要的更改：

$first_char = mb_substr( $tag, 0, 1, $new_utf );

if ( $first_char == mb_convert_encoding( '&', $new_utf ) ) {
  ...
}

我的文本编辑器是 UTF-8，所以如果我将 32 与文件的 & 符号进行比较，它将无法工作。

score 3 · Accepted Answer

如果您需要快速修复并且不关心速度：

$mb_pos = mb_strlen( substr($string, 0, $pos) );

score 0 · Accepted Answer

0

你看过http://www.php.net/manual/en/function.mb-ereg.php吗？

于 2012-03-30T22:03:09.447 回答

php - 如何使用 preg_match 在多字节字符串中获取正确的列表位置

3 回答 3

Related

Reference