0

我对拉丁字符有疑问,这里是代码:

$stopWords = array('i','a','about','an','and','are','as','at','be','by','com','de','en','for','from','how','in','is','it','la','of','on','or','that','the','this','to','was','what','when','where','who','will','with','und','the','www', 'on', 'ona', 'ja');

$string = preg_replace('/\s\s+/i', '', $string); // replace whitespace
$string = trim($string); // trim the string

$string = preg_replace('/[^a-zA-Z0-9žšđč掊ĐČĆ -]/', '', $string); // only take alphanumerical characters, but keep the spaces and dashes too…

$string = mb_strtolower($string); // make it lowercase

preg_match_all('/\b.*?\b/i', $string, $matchWords);

$matchWords = $matchWords[0];

foreach ( $matchWords as $key=>$item ) {
    if ( $item == '' || in_array(strtolower($item), $stopWords) || strlen($item) <= 3 ) {
        unset($matchWords[$key]);
    }
}

$wordCountArr = array();
if ( is_array($matchWords) ) {
    foreach ( $matchWords as $key => $val ) {
        $val = strtolower($val);
        if ( isset($wordCountArr[$val]) ) {
            $wordCountArr[$val]++;
        } else {
            $wordCountArr[$val] = 1;
        }
    }
}
arsort($wordCountArr);
$wordCountArr = array_slice($wordCountArr, 0, 10);
return $wordCountArr;

当我$matchWords[0]从这段代码返回时:

preg_match_all('/\b.*?\b/i', $string, $matchWords);

我得到这个字符串,数组上有内爆空间:

ti si mi znaj na srcu kvar znaj znaj znaj srcu ž urka

有空间ž urka

4

1 回答 1

2

来自文档: 单词边界是主题字符串中当前字符和前一个字符不匹配 \w 或 \W 的位置(即一个匹配 \w 而另一个匹配 \W),或者开始或结束如果第一个或最后一个字符分别与 \w 匹配,则为字符串。

ž(包括它之前的空格)匹配\W但 u 匹配\w,因此你会得到 žurka

最后的这些字符将与模式不匹配:

 žšđčć ŽŠĐČĆ :)

...它们都是\W -characters,需要后跟\w -character 以匹配模式(第二个\b

我猜你正在寻找 u 修饰符。尝试

preg_match_all('/\b.*?\b/iu', $string, $matchWords);
于 2012-08-26T23:43:51.220 回答