php - 如何在大字符串中找到相似的文本？

Question

我有一个大字符串 str 和一个针 ndl。现在，我需要从字符串 str 中找到类似的 ndl 文本。例如，

消息来源：“这是一个演示文本，我爱你”。

针：“我爱你”

输出：“我爱你”

消息来源：“我有一个独特的想法。你需要一个吗？”。

针：“一个unik idia”

输出：“一个独特的想法”

我发现我可以使用余弦或曼哈顿相似度度量等相似度度量来做到这一点。但是，我认为这种算法的实现会很困难。您能否建议我使用任何简单或最快的方法来做到这一点，也许使用 php 的任何库函数？TIA

score 1 · Accepted Answer

没有 PHP 原生函数可以实现这个目标。但是 PHP 的可能性仅限于您的想象力。我们不能在 SO 上建议库来实现您的目标，您需要记住，可以标记此类问题作为题外话。因此，我不会建议一些库，而是向您指出您需要探索的方向。

按照设计，您的问题表明您不需要像stripos和 co 这样的简单字符串匹配函数，而正则表达式无法实现这一点。举些例子

独特而独特

并且

idia 和想法

这些功能无法匹配。所以你需要寻找类似的东西levenshtein function。但是因为你需要子字符串而不是整个字符串，而且为了使你的服务器更容易工作，你需要使用一些想象力。levenshtein function你可以例如break在haystack and needle单词，然后用于levenshtein查找最接近您的针的值。

这是实现这一目标的一种方法。仔细阅读评论以理解这个想法，您将能够更好地实现一些东西。

对于只有 ASCII 字符的字符串，实现它相对容易。但是对于其他编码，您可能会遇到很多困难。但是处理多字节字符串的简单方法也可能是这样的：

     function to_ascii($text,$encoding="UTF-8") {
      if (is_string($text)) {
        // Includes combinations of characters that present as a single glyph
        $text = preg_replace_callback('/\X/u', __FUNCTION__, $text);
      }
      elseif (is_array($text) && count($text) == 1 && is_string($text[0])) {
        // IGNORE characters that can't be TRANSLITerated to ASCII
        $text = @iconv($encoding, "ASCII//IGNORE//TRANSLIT", $text[0]);
        // The documentation says that iconv() returns false on failure but it returns ''
        if ($text === '' || !is_string($text)) {
          $text = '?';
        }
        elseif (preg_match('/\w/', $text)) {        // If the text contains any letters...
          $text = preg_replace('/\W+/', '', $text); // ...then remove all non-letters
        }
      }
      else {  // $text was not a string
        $text = '';
      }
      return $text;
    }





function find_similar($needle,$str,$keep_needle_order=false){
    if(!is_string($needle)||!is_string($str))
    {
        return false;
    }
    $valid=array();
    //get  encodings  and words from haystack and needle
    setlocale(LC_CTYPE, 'en_GB.UTF8');
    $encoding_s=mb_detect_encoding($str);
    $encoding_n=mb_detect_encoding($needle);

    mb_regex_encoding ($encoding_n);
    $pneed=array_filter(mb_split('\W',$needle));

    mb_regex_encoding ($encoding_s);
    $pstr=array_filter(mb_split('\W',$str));



    foreach($pneed as $k=>$word)//loop trough needle's words
    {
        foreach($pstr as $key=>$w)
        {
            if($encoding_n!==$encoding_s)
            {//if $encodings are not the same make some transliteration
                $tmp_word=($encoding_n!=='ASCII')?to_ascii($word,$encoding_n):$word; 
                $tmp_w=($encoding_s!=='ASCII')?to_ascii($w,$encoding_s):$w;
            }else
            {
                $tmp_word=$word;
                $tmp_w=$w;
            }

            $tmp[$tmp_w]=levenshtein($tmp_w,$tmp_word);//collect levenshtein distances
            $keys[$tmp_w]=array($key,$w);

        }

        $nominees=array_flip(array_keys($tmp,min($tmp)));//get the nominees
        $tmp=10000;
        foreach($nominees as $nominee=>$idx)
        {//test sound like to get more precision
            $idx=levenshtein(metaphone($nominee),metaphone($tmp_word));
            if($idx<$tmp){
                $answer=$nominee;//get the winner

            }
            unset($nominees[$nominee]);
        }
        if(!$keep_needle_order){
            $valid[$keys[$answer][0]]=$keys[$answer][1];//get the right form of the winner
        }
        else{
            $valid[$k]=$keys[$answer][1];
        }
        $tmp=$nominees=array();//clean a little for the next iteration
    }
    if(!$keep_needle_order)
    {
        ksort($valid);
    }

    $valid=array_values($valid);//get only the values
    /*return the array of the closest value to the 
    needle according to this algorithm of course*/
    return $valid;

}


var_dump(find_similar('i knew you love me','finally  i know you loved me and all my pets'));
var_dump(find_similar('I you love','This is a demo text and I love you about this'));
var_dump(find_similar('a unik idia','I have a unique idea. Do you need?'));
var_dump(find_similar("Goebel, Weiss, Goethe, Goethe und Goetz",'Weiß, Goldmann, Göbel, Weiss, Göthe, Goethe und Götz'));
var_dump(find_similar('Ḽơᶉëᶆ ȋṕšᶙṁ ḍỡḽǭᵳ ʂǐť ӓṁệẗ, ĉṓɲṩḙċťᶒțûɾ ấɖḯƥĭṩčįɳġ ḝłįʈ',
'Ḽơᶉëᶆ ȋṕšᶙṁ ḍỡḽǭᵳ ʂǐť ӓṁệẗ, ĉṓɲṩḙċťᶒțûɾ ấɖḯƥĭṩčįɳġ ḝłįʈ, șếᶑ ᶁⱺ ẽḭŭŝḿꝋď ṫĕᶆᶈṓɍ ỉñḉīḑȋᵭṵńť ṷŧ ḹẩḇőꝛế éȶ đꝍꞎôꝛȇ ᵯáꞡᶇā ąⱡîɋṹẵ.'));

输出是：

    array(5) {
  [0]=>
  string(1) "i"
  [1]=>
  string(4) "know"
  [2]=>
  string(3) "you"
  [3]=>
  string(5) "loved"
  [4]=>
  string(2) "me"
}
array(3) {
  [0]=>
  string(1) "I"
  [1]=>
  string(4) "love"
  [2]=>
  string(3) "you"
}
array(3) {
  [0]=>
  string(1) "a"
  [1]=>
  string(6) "unique"
  [2]=>
  string(4) "idea"
}
array(5) {
  [0]=>
  string(6) "Göbel"
  [1]=>
  string(5) "Weiss"
  [2]=>
  string(6) "Goethe"
  [3]=>
  string(3) "und"
  [4]=>
  string(5) "Götz"
}
array(8) {
  [0]=>
  string(13) "Ḽơᶉëᶆ"
  [1]=>
  string(13) "ȋṕšᶙṁ"
  [2]=>
  string(14) "ḍỡḽǭᵳ"
  [3]=>
  string(6) "ʂǐť"
  [4]=>
  string(11) "ӓṁệẗ"
  [5]=>
  string(26) "ĉṓɲṩḙċťᶒțûɾ"
  [6]=>
  string(23) "ấɖḯƥĭṩčįɳġ"
  [7]=>
  string(9) "ḝłįʈ"
}

如果您需要将输出作为字符串，您可以join在使用函数的结果之前使用它

您可以运行工作代码并在线检查结果

但您必须记住，这不适用于所有类型的字符串，也不适用于所有 PHP 版本

score 0 · Accepted Answer

这是一种非常简单的方法：

$source = "This is a demo text and I love you about this";
$needle = "I you love";
$words = explode(" " , $source);
$needleWords = explode(" ", $needle);
$results = [];

foreach($needleWords as $key => $needleWord) {

    foreach($words as $keyWords => $word) {

        if(strcasecmp($word, $needleWord) == 0) {
            $results[$keyWords] = $needleWord;
        }
    }
}
uksort($results, function($a , $b) {
    return $a - $b;
});
echo(implode(" " , $results));

输出

I love you

score 0 · Accepted Answer

尝试使用此代码在字符串中查找字符串

$data = "I have a unique idea. Do you need one?";
$find = "a unique idea";
$start = strpos($data, $find);
if($start){     
    $end = $start + strlen($find);
    print_r(substr($data, $start, strlen($find)));
} else {
    echo "not found";
}

php - 如何在大字符串中找到相似的文本？

3 回答 3

Related

Reference