php - PHP 脏话过滤器

Question

我正在开发一个 WordPress 插件，它用列表中的随机新词替换评论中的坏词。

我现在有 2 个数组：一个包含坏词，另一个包含好词。

$bad = array("bad", "words", "here");
$good = array("good", "words", "here");

由于我是初学者，所以我在某些时候卡住了。

为了替换坏词，我一直在使用$newstring = str_replace($bad, $good, $string);.

我的第一个问题是我想关闭区分大小写，所以我不会放这样的单词"bad", "Bad", "BAD", "bAd", "BAd", etc但我需要新单词保持原单词的格式，例如如果我写“Bad”，它会被替换为“Words”，但如果我输入“bad”，它将被替换为“words”等。

我的第一个技巧是使用str_ireplace，但它忘记了原始单词是否有大写字母。

第二个问题是我不知道如何处理这样类型的用户：“ba d”、“word s”等。我需要一个想法。

为了让它选择一个随机词，我想我可以使用$new = $good[rand(0, count($good)-1)];then $newstring = str_replace($bad, $new, $string);。如果你有更好的主意，我来听听。

我的脚本的一般外观：

function noswear($string)
{
    if ($string)
    {       
        $bad = array("bad", "words");
        $good = array("good", "words"); 
        $newstring = str_replace($bad, $good, $string);     
        return $newstring;
}

echo noswear("I see bad words coming!");

预先感谢您的帮助！

score 11 · Accepted Answer

前体

有（正如评论中多次指出的那样）你 - 和/或你的代码 - 通过实现这样的功能陷入了巨大的漏洞，仅举几例：

人们会添加字符来欺骗过滤器
人们将变得有创造力（例如影射）
人们会使用被动攻击和讽刺
人们会使用句子/短语而不仅仅是单词

你最好实施一个审核/标记系统，人们可以标记攻击性评论，然后可以由模组、用户等编辑/删除。

基于这个理解，让我们继续……

解决方案

鉴于您：

有一个禁用词表$bad_words
有一个替换词表$good_words
不分大小写要换坏词
想用随机的好词替换坏词
有一个正确转义的坏词列表：见http://php.net/preg_quote

你可以很容易地使用PHPspreg_replace_callback函数：

$input_string = 'This Could be interesting but should it be? Perhaps this \'would\' work; or couldn\'t it?';

$bad_words  = array('could', 'would', 'should');
$good_words = array('might', 'will');

function replace_words($matches){
    global $good_words;
    return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
}

echo preg_replace_callback('/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i', 'replace_words', $input_string);

好的，preg_replace_callback它的作用是编译一个包含所有坏词的正则表达式模式。比赛将采用以下格式：

/(START OR WORD_BOUNDARY OR WHITE_SPACE)(BAD_WORD)(WORD_BOUNDARY OR WHITE_SPACE OR END)/i

i修饰符使其不区分大小写，因此两者都bad匹配Bad。

然后该函数replace_words获取匹配的单词及其边界（空白或空白字符）并将其替换为边界和随机的好单词。

global $good_words; <-- Makes the $good_words variable accessible from within the function
$matches[1] <-- The word boundary before the matched word
$matches[3] <-- The word boundary after  the matched word
$good_words[rand(0, count($good_words)-1] <-- Selects a random good word from $good_words

匿名函数

您可以在preg_replace_callback

echo preg_replace_callback(
        '/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i',
        function ($matches) use ($good_words){
            return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
        },
        $input_string
    );

函数包装

如果您要多次使用它，您也可以将其编写为一个独立的函数，尽管在这种情况下，您很可能希望在调用它时将好/坏词输入函数（或将它们永久地硬编码在那里）但这取决于你如何获得它们......

function clean_string($input_string, $bad_words, $good_words){
    return preg_replace_callback(
        '/(^|\b|\s)('.implode('|', $bad_words).')(\b|\s|$)/i',
        function ($matches) use ($good_words){
            return $matches[1].$good_words[rand(0, count($good_words)-1)].$matches[3];
        },
        $input_string
    );
}

echo clean_string($input_string, $bad_words, $good_words);

输出

使用第一个示例中显示的输入和单词列表连续运行上述函数：

This will be interesting but might it be? Perhaps this 'will' work; or couldn't it?
This might be interesting but might it be? Perhaps this 'might' work; or couldn't it?
This might be interesting but will it be? Perhaps this 'will' work; or couldn't it?

当然，替换词是随机选择的，所以如果我刷新页面，我会得到别的东西......但这显示了什么会/不会被替换。

注意

逃跑`$bad_words`

foreach($bad_words as $key=>$word){
    $bad_words[$key] = preg_quote($word);
}

单词边界`\b`

在这段代码中，我使用\b,\s和^or$作为单词边界，这是有充分理由的。虽然white space、start of string和end of string都被视为单词边界\b，但并非在所有情况下都匹配，例如：

\b\$h1t\b <---Will not match

这是因为\b匹配非单词字符（即[^a-zA-Z0-9]）和类似字符的字符$不算作单词字符。

杂项

根据您的单词列表的大小，有几个潜在的问题。从系统设计的角度来看，拥有大量正则表达式通常是一种糟糕的形式，原因如下：

可能很难维护
很难阅读/理解它的作用
很难发现错误
如果列表太大，可能会占用大量内存

鉴于正则表达式模式是由PHP第一个原因被否定的。第二个也应该被否定；如果你的单词列表很大，每个坏词都有十几个排列，那么我建议你停下来重新考虑你的方法（阅读：使用标记/审核系统）。

澄清一下，我不认为有一个问题是有一个小的单词列表来过滤特定的脏话，因为它有一个目的：阻止用户彼此爆发；当您尝试过滤掉太多（包括排列）时，问题就来了。坚持过滤常见的脏话，如果这不起作用，那么 -最后一次- 实施标记/审核系统。

score 5 · Accepted Answer

我想出了这个方法，它工作正常。返回true，以防条目中有坏词条目。

例子：

function badWordsFilter($inputWord) {
  $badWords = Array("bad","words","here");
  for($i=0;$i<count($badWords);$i++) {
     if($badWords[$i] == strtolower($inputWord))
        return true;
     }
  return false;
}

用法：

if (badWordsFilter("bad")) {
    echo "Bad word was found";
} else {
    echo "No bad words detected";
}

由于“坏”这个词被列入黑名单，它会回声。

在线示例1

编辑1：

正如rid所提供的，也可以进行简单的in_array检查：

function badWordsFilter($inputWord) {
  $badWords = Array("bad","words","here");
     if(in_array(strtolower($inputWord), $badWords) ) {
        return true;
     }
  return false;
}

在线示例2

编辑2：

正如我所承诺的那样，正如您在问题中提到的那样，我提出了用好词代替坏词的稍微不同的想法。我希望它会对你有所帮助，但这是我目前能提供的最好的，因为我完全不确定你想要做什么。

例子：

1. 让我们将一个包含坏词和好词的数组合二为一

$wordsTransform = array(
  'shit' => 'ship'
);

2.你想象中的用户输入

$string = "Rolling In The Deep by Adel\n
\n
There's a fire starting in my heart\n
Reaching a fever pitch, and it's bringing me out the dark\n
Finally I can see you crystal clear\n
Go ahead and sell me out and I'll lay your shit bare";

3. 用好话代替坏话

$string = strtr($string, $wordsTransform);

4. 得到想要的输出

在黑暗中翻滚

我的心开始燃烧
达到发烧的程度，它把我带出黑暗
终于我可以看到你晶莹剔透
继续卖我，我会把你的船光秃秃的

在线示例 3

编辑 3：

为了遵循 Wrikken的正确评论，我完全忘记了它strtr是区分大小写的，最好遵循单词边界。我从
PHP 中借用了以下示例：strtr - Manual并稍作修改。

与我的第二次编辑中的想法相同但不依赖于寄存器，它检查单词边界并在作为正则表达式语法一部分的每个字符前面放置一个反斜杠：

一、方法：

//
// Written by Patrick Rauchfuss
class String
{
    public static function stritr(&$string, $from, $to = NULL)
    {
        if(is_string($from))
            $string = preg_replace("/\b{$from}\b/i", $to, $string);

        else if(is_array($from))
        {
            foreach ($from as $key => $val)
                self::stritr($string, $key, $val);
        }
        return preg_quote($string); // return and add a backslash to special characters
    }
}

2. 有好有坏的数组

$wordsTransform = array(
            'shit' => 'ship'
        );

3.更换

String::stritr($string, $wordsTransform);

php - PHP 脏话过滤器

2 回答 2

前体

解决方案

匿名函数

函数包装

输出

注意

逃跑`$bad_words`

单词边界`\b`

杂项

在线示例1

在线示例2

在线示例 3

在线示例 4

php - PHP 脏话过滤器

2 回答 2

前体

解决方案

匿名函数

函数包装

输出

注意

逃跑$bad_words

单词边界\b

杂项

在线示例1

在线示例2

在线示例 3

在线示例 4

Related

Reference

逃跑`$bad_words`

单词边界`\b`