php - 如何在 UTF-8 文本中获取单词的正确位置？

Question

我有一个简单的 PHP 代码来获取一个文本的句子并加粗一个特定的单词。

首先，我得到一个数组，其中包含我想要的单词及其在文本中的位置。

$all_words = str_word_count($text, 2, 'åæéø');

// $words is an array with the words that I want find.
$words_found = array();
foreach ($all_words as $pos => $word_found) {
  foreach ($words as $word) {
    if ($word == strtolower($word_found)) {
      $words_found[$pos] = $word_found;
      break;
    }
  }
}

然后，对于其中的每个单词，$words_found我都会得到一部分文本，中间是单词。

$length = 90;
foreach ($words_found as $offset => $word) {
  $word_length = strlen($word);

  $start = $offset - $length;
  $last_start = $start + $length + $word_length;

  $first_part = substr($text, $start, $length);
  $last_part = substr($text, $last_start, $length);

  $sentence = $first_part . '<b>' . $word . '</b>' . $last_part;
}

它工作正常，除了文本是UTF-8带有丹麦字符（åæéø）的文本。因此，当$first_partor$last_part以 unicode 字符开头时，susbtr 字符串为空。

我知道mb_substr函数，所以我用它替换我的代码。

$word_length = mb_strlen($word, 'UTF-8');
$first_part = mb_substr($text, $start, $length, 'UTF-8');
$last_part = mb_substr($text, $last_start, $length, 'UTF-8');

但是使用这个函数 ( mb_substr)，单词 ( ) 的位置$offset是错误的，新的子字符串 ( $sentence) 与应有的不匹配。

它是否存在类似的东西mb_str_word_count？如何获得单词的正确位置？

score 2 · Accepted Answer

尝试使用带有单词边界的正则表达式

$string = 'That this notpink a or pink blue red dark.';
$regex = '/\bpink\b/';
preg_match($regex, $string, $match, PREG_OFFSET_CAPTURE);
$pos = $match[0][1];
echo $pos;

编辑：

如果您不喜欢正则表达式，可以使用空格将 word 与 stripos 匹配

if(stripos($string, 'pink ') === 0)
    $pos = 0;
else if(stripos($string, ' pink') !== false)
    $pos = stripos($string, ' pink') + 1;
else
    $pos = stripos($string, ' pink ') + 1;

score 1 · Accepted Answer

我尝试了@Mario Johnathan 的解决方案，但它对我来说不能正常工作。

最后我自己得到了一个解决方案：我使用非多字节函数，如substr和给出的位置str_word_count，如果第一个字符是丹麦字符，则解决方案是更改第一个子字符串。

$first_part_aux = str_split(trim($first_part));

if (!ctype_alpha($first_part_aux[0])) {
  for ($i = 1; $i < count($first_part_aux); $i++) {
    if (ctype_alpha($first_part_aux[$i])) {
      $start = $start + $i;
      $length = $length - $i;

      $first_part = substr($text, $start, $length);

      break;
    }
  }
}

php - 如何在 UTF-8 文本中获取单词的正确位置？

2 回答 2

Related

Reference