我正在尝试从字符串中删除噪音词,并且我有我认为是一个很好的算法,但我遇到了障碍。在我执行 preg_replace 之前,我删除了除撇号 (') 之外的所有标点符号。我把它放在这个 preg_replace 中:
$content = preg_replace('/\b('.implode('|', self::$noiseWords).')\b/','',$content);
效果很好,除了确实具有 ' 字符的单词。preg_replace 似乎将其视为边界字符。这对我来说是个问题。
有没有办法解决这个问题?也许是一个不同的解决方案?
谢谢!
这是我正在使用的示例:
$content = strtolower(strip_tags($content));
$content = preg_replace("/(?!['])\p{P}/u", "", $content);// remove punctuation
echo $content;// i've added striptags for editing as well should still workyep it doesnbsp
$content = preg_replace("/\b(?<')(".implode('|', self::$noiseWords).")(?!')\b/",'',$content);
$contentArray = explode(" ", $content);
print_r($contentArray);
在第 3 行,您将看到 preg_replace 之前的 $content 的注释
尽管我假设您可以猜到我的 noiseWords 数组是什么样的,但这只是其中的一小部分:
$noiseWords = array("a", "able","about","above","abroad","according","accordingly","across",
"actually","adj","after","afterwards","again",......)