php - preg_match_all 删除拉丁字母

Question

我对拉丁字符有疑问，这里是代码：

$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www', 'on', 'ona', 'ja');

$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string

$string = preg_replace('/[^a-zA-Z0-9žšđčćŽŠĐČĆ -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…

$string = mb_strtolower($string); // make it lowercase

preg_match_all('/\b.*?\b/i', $string, $matchWords);

$matchWords = $matchWords[0];

foreach ( $matchWords as $key=>$item ) {
    if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
        unset($matchWords[$key]);
    }
}

$wordCountArr = array();
if ( is_array($matchWords) ) {
    foreach ( $matchWords as $key => $val ) {
        $val = strtolower($val);
        if ( isset($wordCountArr[$val]) ) {
            $wordCountArr[$val]++;
        } else {
            $wordCountArr[$val] = 1;
        }
    }
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;

当我$matchWords[0]从这段代码返回时：

preg_match_all('/\b.*?\b/i', $string, $matchWords);

我得到这个字符串，数组上有内爆空间：

ti si mi znaj na srcu kvar znaj znaj znaj srcu ž urka

有空间ž urka

score 2 · Accepted Answer

来自文档：单词边界是主题字符串中当前字符和前一个字符不匹配 \w 或 \W 的位置（即一个匹配 \w 而另一个匹配 \W），或者开始或结束如果第一个或最后一个字符分别与 \w 匹配，则为字符串。

ž（包括它之前的空格）匹配\W但 u 匹配\w，因此你会得到 ž和urka

最后的这些字符将与模式不匹配：

 žšđčć ŽŠĐČĆ :)

...它们都是\W -characters，需要后跟\w -character 以匹配模式（第二个\b）

我猜你正在寻找 u 修饰符。尝试

preg_match_all('/\b.*?\b/iu', $string, $matchWords);

php - preg_match_all 删除拉丁字母

1 回答 1

Related

Reference