0

今晚意识到我正在使用的剥离功能之一似乎是随机跳过单词。

<?php
function wordstrip($document){ 
  //I truncated the list here for brevity
$wordlist = array(
"it39s",
"039",
"the",
"while",
"message");

//convert all uppercase to lower so matches work correctly
$document = strtolower($document);
            foreach($wordlist as $word)

            $document = preg_replace("/\s". $word ."\s/", " ", $document);
            //echo $word;
            //echo $document;
            $nopunc = preg_replace('/[^a-z0-9]+/i', ' ', $document);
            $trimmed = trim($nopunc);
            return $trimmed; 
    } 

?>

它跳过了“the”这个词,我不知道为什么。该列表大约有 200 个字长,我知道它的工作原理,因为它去掉了大多数其他单词。

我给它喂了“垂死的退伍军人给乔治·W·布什和迪克·切尼的最后一封信”,然后取回了“一位垂死的退伍军人写给乔治·W·布什和迪克·切尼的最后一封信”

我认为这是由于“/\s”,因为“the”位于字符串的开头。我试过“/\s?” 但这没有用。我想我只需要使空格可选吗?

谢谢

4

1 回答 1

2

You could use \b to represent a word boundary and not fiddle with the spaces or periods or whatever else might surround a word:

$document = strtolower($document);

foreach($wordlist as $word)
    $document = preg_replace("/\b". $word ."\b/", " ", $document);

$nopunc = preg_replace('/[^a-z0-9]+/i', ' ', $document);
$trimmed = trim($nopunc);
return $trimmed;
于 2013-03-21T00:35:58.343 回答