php - String key phrase matching

Question

In levenstein how are you, hw r u, how are u, and hw ar you can be compare as same,

Is there anyway i can achieved this

if i have a phrase like.

phrase

hi, my name is john doe. I live in new york. What is your name?

phrase

My name is Bruce. wht's your name

key phrase

What is your name

response

my name is batman.

im getting the input from user.I have a table with a list of possible request with response. for example the user will ask about 'its name', is there a way i can check if a sentence has a key phrase like What is your name and if its found it will return the possible response

like

phrase = ' hi, my name is john doe. I live in new york. What is your name?'
 
//I know this one will work
if (strpos($phrase,"What is your name") !== false) {
    return $response;
}

//but what if the user mistype it 
if (strpos($phrase,"Wht's your name") !== false) {
    return $response;
}

is there i way to achieve this. levenstein works perfect only if the lenght of strings are not that long with the compared string.

like

hi,wht's your name

my name is batman.

but if it so long

hi, my name is john doe. I live in new york. What is your name?

its not working well. if there are shorter phrase, it will identify the shorter phrase that have a shorter distance and return a wrong response

i was thinking another way around is to check some key phrase. so any idea to achieve this one?

i was working on something like this but maybe there is a better and proper way i think

$samplePhrase = 'hi, im spongebob, i work at krabby patty. i love patties. Whts your name my friend';

$keyPhrase = 'What is your name';

get first character of keyPhrase. That would be 'W' iterate through
$samplePhrase characters and compare to first character of keyPhrase
h,i, ,i,m, ,s,p etc. . .
if keyPhrase.char = samplePhrase.currentChar
get keyPhrase.length
get samplePhrase.currentChar index
get substring of samplePhrase base on the currentChar index to keyPhrase.length
the first it will get would be work at krabby pa
compare work at krabby pa to $keyPhrase ('What is your name') using levenstiens distance
and to check it better use semilar_text. 11.if not equal and distance is to big repeat process.

score 1 · Accepted Answer

我的建议是从关键短语生成一个 n-gram 列表，并计算每个 n-gram 和关键短语之间的编辑距离。

例子：

key phrase: "What is your name"
phrase 1: "hi, my name is john doe. I live in new york. What is your name?"
phrase 2: "My name is Bruce. wht's your name"

一个可能的匹配 n-gram 长度在 3 到 4 个单词之间，因此我们为每个短语创建所有 3-gram 和 4-gram，我们还应该通过删除标点符号和小写所有内容来规范化字符串。

phrase 1 3-grams:
"hi my name", "my name is", "name is john", "is john doe", "john doe I", "doe I live"... "what is your", "is your name"
phrase 1 4-grams:
"hi my name is", "my name is john doe", "name is john doe I", "is john doe I live"... "what is your name"

phrase 2 3-grams:
"my name is", "name is bruce", "is bruce wht's", "bruce wht's your", "wht's your name"
phrase 2 4-grmas:
"my name is bruce", "name is bruce wht's", "is bruce wht's your", "bruce wht's your name"

接下来，您可以对每个 n-gram 进行列文斯坦距离，这应该可以解决您上面介绍的用例。如果您需要进一步规范化每个单词，您可以使用语音编码器，例如 Double Metaphone 或 NYSIIS，但是，我对所有“通用”语音编码器进行了测试，在您的情况下，它没有显示出显着的改进，语音编码器更多适合名字。

我对 PHP 的经验有限，但这里有一个代码示例：

<?php
function extract_ngrams($phrase, $min_words, $max_words) {
    echo "Calculating N-Grams for phrase: $phrase\n";
    $ngrams = array();
    $words  = str_word_count(strtolower($phrase), 1);
    $word_count = count($words);

    for ($i = 0; $i <= $word_count - $min_words; $i++) {
        for ($j = $min_words; $j <= $max_words && ($j + $i) <= $word_count; $j++) {
            $ngrams[] = implode(' ',array_slice($words, $i, $j));
        }
    }
    return array_unique($ngrams);
}

function contains_key_phrase($ngrams, $key) {
    foreach ($ngrams as $ngram) {
        if (levenshtein($key, $ngram) < 5) {
            echo "found match: $ngram\n";
            return true;
        }
    }
    return false;
}

$key_phrase = "what is your name";
$phrases = array(
        "hi, my name is john doe. I live in new york. What is your name?",
        "My name is Bruce. wht's your name"
        );
$min_words = 3;
$max_words = 4;

foreach ($phrases as $phrase) {
    $ngrams = extract_ngrams($phrase, $min_words, $max_words);
    if (contains_key_phrase($ngrams,$key_phrase)) {
        echo "Phrase [$phrase] contains the key phrase [$key_phrase]\n";
    }
}
?>

输出是这样的：

计算短语的 N-Grams：嗨，我的名字是 john doe。我住在纽约。你叫什么名字？
找到匹配项：你叫什么名字
短语 [嗨，我的名字是 john doe。我住在纽约。你叫什么名字？] 包含关键短语 [你叫什么名字]
计算短语的 N-Grams：我的名字是 Bruce。你叫什么名字
找到匹配项：你叫什么名字
短语 [我的名字是布鲁斯。wht's your name] 包含关键短语 [what is your name]

编辑：我注意到一些建议为生成的 n-gram 中的每个单词添加语音编码。我不确定语音编码是解决这个问题的最佳答案，因为它们主要针对词干名称（美式、德语或法语，具体取决于算法）进行调整，并且不太擅长词干化。

实际上，我编写了一个测试来在 Java 中验证这一点（因为编码器更容易获得）这里是输出：

============================
创建了新的语音匹配器
    引擎：Caverphone2
    关键词：你叫什么名字
    编码的关键短语：WT11111111 AS11111111 YA11111111 NM11111111
找到匹配项：[你叫什么名字？] 编码：WT11111111 AS11111111 YA11111111 NM11111111
短语：[嗨，我的名字是 john doe。我住在纽约。你叫什么名字？] 匹配：是的
短语：[我的名字是布鲁斯。你叫什么名字] MATCH: false
============================
创建了新的语音匹配器
    引擎：DoubleMetaphone
    关键词：你叫什么名字
    编码的关键短语：AT AS AR NM
找到匹配项：[What is your] 编码：AT AS AR
短语：[嗨，我的名字是 john doe。我住在纽约。你叫什么名字？] 匹配：是的
找到匹配项：[wht's your name] 编码：ATS AR NM
短语：[我的名字是布鲁斯。你叫什么名字] MATCH: true
============================
创建了新的语音匹配器
    引擎：尼西斯
    关键词：你叫什么名字
    编码的关键短语：WAT I YAR NAN
找到匹配项：[你叫什么名字？] 编码：WAT I YAR NAN
短语：[嗨，我的名字是 john doe。我住在纽约。你叫什么名字？] 匹配：是的
找到匹配：[wht's your name] 编码：WT YAR NAN
短语：[我的名字是布鲁斯。你叫什么名字] MATCH: true
============================
创建了新的语音匹配器
    引擎：Soundex
    关键词：你叫什么名字
    编码关键短语：W300 I200 Y600 N500
找到匹配项：[你叫什么名字？] 编码：W300 I200 Y600 N500
短语：[嗨，我的名字是 john doe。我住在纽约。你叫什么名字？] 匹配：是的
短语：[我的名字是布鲁斯。你叫什么名字] MATCH: false
============================
创建了新的语音匹配器
    引擎：RefinedSoundex
    关键词：你叫什么名字
    编码关键短语：W06 I03 Y09 N8080
找到匹配项：[你叫什么名字？] 编码：W06 I03 Y09 N8080
短语：[嗨，我的名字是 john doe。我住在纽约。你叫什么名字？] 匹配：是的
找到匹配项：[wht's your name] 编码：W063 Y09 N8080
短语：[我的名字是布鲁斯。你叫什么名字] MATCH: true

在运行这些测试时，我使用了 4 的 levenshtein 距离，但我很确定您会发现使用语音编码器无法正确匹配的多个边缘情况。通过查看示例，您可以看到由于编码器完成的词干提取，您实际上在以这种方式使用它们时更有可能出现误报。请记住，这些算法最初旨在查找人口普查中具有相同姓名的人，而不是真正的英语单词“听起来”相同。

score 1 · Accepted Answer

您要实现的是一项非常复杂的自然语言处理任务，它通常需要解析等。

我要建议的是创建一个句子标记器，将短语分成句子。然后标记每个句子在空格、标点符号上的分割，并可能还将一些缩写重写为更正常的形式。

然后，您可以创建自定义逻辑，遍历每个句子的标记列表以查找特定含义。例如：['...','what','...','...','your','name','...','...','?']也可以表示你的名字。句子可以是“那么，你到底叫什么名字？” 或“你叫什么名字？”

我以添加代码为例。我并不是说你应该使用那么简单的东西。下面的代码在 php 中使用NlpTools一个自然语言处理库（我参与了该库，所以请随意假设我有偏见）。

 <?php

 include('vendor/autoload.php');

 use \NlpTools\Tokenizers\ClassifierBasedTokenizer;
 use \NlpTools\Classifiers\Classifier;
 use \NlpTools\Tokenizers\WhitespaceTokenizer;
 use \NlpTools\Tokenizers\WhitespaceAndPunctuationTokenizer;
 use \NlpTools\Documents\Document;

 class EndOfSentence implements Classifier
 {
     public function classify(array $classes, Document $d)
     {
         list($token, $before, $after) = $d->getDocumentData();

         $lastchar = substr($token, -1);
         $dotcnt = count(explode('.',$token))-1;

         if (count($after)==0)
             return 'EOW';

         // for some abbreviations
         if ($dotcnt>1)
             return 'O';

         if (in_array($lastchar, array(".","?","!")))
             return 'EOW';
     }
 }

 function normalize($s) {
     // get this somewhere static
     $hash_table = array(
         'whats'=>'what is',
         'whts'=>'what is',
         'what\'s'=>'what is',
         '\'s'=>'is',
         'n\'t'=>'not',
         'ur'=>'your'
         // .... more ....
     );

     $s = mb_strtolower($s,'utf-8');
     if (isset($hash_table[$s]))
         return $hash_table[$s];
     return $s;
 }

 $whitespace_tok = new WhitespaceTokenizer();
 $punct_tok = new WhitespaceAndPunctuationTokenizer();
 $sentence_tok = new ClassifierBasedTokenizer(
     new EndOfSentence(),
     $whitespace_tok
 );

 $text = 'hi, my name is john doe. I live in new york. What\'s your name? whts ur name';

 foreach ($sentence_tok->tokenize($text) as $sentence) {
     $words = $whitespace_tok->tokenize($sentence);
     $words = array_map(
         'normalize',
         $words
     );
     $words = call_user_func_array(
         'array_merge',
         array_map(
             array($punct_tok,'tokenize'),
             $words
         )
     );

     // decide what this sequence of tokens is
     print_r($words);
 }

score 0 · Accepted Answer

您可能会考虑使用 soundex 函数将输入字符串转换为语音等效的书写，然后继续进行搜索。声音

score 0 · Accepted Answer

首先修复所有短代码示例

$txt=$_POST['txt']
$txt=str_ireplace("hw r u","how are You",$txt);
$txt=str_ireplace(" hw "," how ",$txt);//remember an space before and after phrase is required else it will replace all occurrence of hw(even inside a word if hw exists).
$txt=str_ireplace(" r "," are ",$txt);
$txt=str_ireplace(" u "," you ",$txt);
$txt=str_ireplace(" wht's "," What is ",$txt);

同样，添加任意数量的短语。现在只需检查本文中所有可能的问题并获取它们的位置

if (strpos($phrase,"What is your name")) {//No need to add "!=" false
    return $response;
}

php - String key phrase matching

4 回答 4

Related

Reference