php - 提高这个 PHP 拼写检查器的速度和效率

Question

我构建了一个简单的 PHP 拼写检查和建议应用程序，它使用 PHP 的similar_text() 和 levenshtein() 函数来比较字典中加载到数组中的单词。

它是如何工作的：首先我将字典的内容加载到一个数组中。
我将用户的输入拆分为单词并对每个单词进行拼写检查。
我通过检查单词是否在字典数组中来进行拼写检查。
如果是，那么我会回应祝贺信息并继续前进。
如果不是，我遍历字典数组，将字典数组中的每个单词与假设的拼写错误进行比较。
如果输入的单词（小写且没有标点符号）与字典数组中的单词有 90% 或更多相似度，那么我将该单词从字典数组复制到建议数组中。
如果使用 90% 或更高的相似性比较没有找到建议，那么我使用 levenshtein() 进行更自由的比较并将建议添加到建议数组中。
然后我遍历建议数组并回显每个建议。

我注意到这运行缓慢。足够慢才能注意到。我想知道如何提高这个拼写检查器的速度和效率。

欢迎和赞赏任何和所有更改、改进、建议和代码。

这里是代码（语法高亮代码，请访问这里）：

<?php
    function addTo($line) {
        return strtolower(trim($line));
    }

    $words = array_map('addTo', file('dictionary.txt'));
    $words = array_unique($words);

    function checkSpelling($input, $words) {
        $suggestions = array();
        if (in_array($input, $words)) {
            echo "you spelled the word right!";
        }
        else {
            foreach($words as $word) {
                $percentageSimilarity = 0.0;
                $input = preg_replace('/[^a-z0-9 ]+/i', '', $input);
                similar_text(strtolower(trim($input)), strtolower(trim($word)), $percentageSimilarity);
                if ($percentageSimilarity >= 90 && $percentageSimilarity<100) {
                    if(!in_array($suggestions)){
                        array_push($suggestions, $word);
                    }
                }
            }
            if (empty($suggestions)) {
                foreach($words as $word) {
                    $input = preg_replace('/[^a-z0-9 ]+/i', '', $input);
                    $levenshtein = levenshtein(strtolower(trim($input)), strtolower(trim($word)));
                    if ($levenshtein <= 2 && $levenshtein>0) {
                        if(!in_array($suggestions)) {
                            array_push($suggestions, $word);
                        }
                    }
                }
            }
            echo "Looks like you spelled that wrong. Here are some suggestions: <br />";
            foreach($suggestions as $suggestion) {
                echo "<br />".$suggestion."<br />";
            }
        }
    }

    if (isset($_GET['check'])) {
        $input = trim($_GET['check']);
        $sentence = '';
        if (stripos($input, ' ') !== false) {
            $sentence = explode(' ', $input);
            foreach($sentence as $item){
                checkSpelling($item, $words);
            }
        }
        else {
            checkSpelling($input, $words);
        }
    }

?>

<!Doctype HTMl>
<html lang="en">
    <head>
        <meta charset="utf-8" />
        <title>Spell Check</title>
    </head>
    <body>
        <form method="get">
             <input type="text" name="check" autocomplete="off" autofocus />
        </form>
    </body>
 </html>

score 0 · Accepted Answer

Levenshtein 在一个大列表上将是相当密集的处理器。现在，如果您输入错误的冰箱，它将计算到猫狗和疙瘩的编辑距离。

在进入 levenstein 循环之前，您可以将列表与预先计算的每个字典条目的变音或 soundex 键进行匹配。这将为您提供一个更短的可能建议列表，然后您可以使用 levenshtein 和similar_text 作为对匹配的短列表进行排名的一种方式。

另一件可以帮助你的事情是缓存你的结果。我冒昧地猜测大多数拼写错误会很常见。

以下实现没有有效地处理数据的配对，但它应该为您提供一些指导，说明如何避开每个单词的整个字典的 levenshtein 距离。

您要做的第一件事是将变音位结果附加到每个单词条目中。这将是一种可行的方法

<?php
$dict = fopen("dictionary-orig.txt", "r");
$keyedDict = fopen("dictionary.txt", "w");
while ($line = fgets($dict)){
    $line = trim(strtolower($line));
    fputcsv($keyedDict, array($line,metaphone($line)));
}
fclose($dict);
fclose($keyedDict);
?>

除此之外，您还需要一些可以将字典读入数组的东西

<?php
function readDictionary($file){
        $dict = fopen($file, "r");
        $words = array();
        while($line = fgetcsv($dict)){
                $words[$line[0]] = $line[1];
        }
        return $words;
}
function checkSpelling($input, $words){
    if(array_key_exists($input, $words)){

        return;
    }
        else {
        // sanatize the input
        $input = preg_replace('/[^a-z0-9 ]+/i', '', $input);
        // get the metaphone key for the input
        $inputkey = metaphone($input);
        echo $inputkey."<br/>";
        $suggestions = array();
        foreach($words as $word => $key){
            // get the similarity between the keys
            $percentageSimilarity = 0;
            similar_text($key, $inputkey, $percentageSimilarity);
            if($percentageSimilarity > 90){
                $suggestions[] = array($word, levenshtein($input, $word));
            }
        }
        // rank the suggestions
        usort($suggestions, "rankSuggestions");
        return $suggestions;
    }
}
if(isset($_GET['check'])){
    $words = readDictionary("dictionary.txt");
    $input = trim($_GET['check']);
    $sentence='';
    $sentence = explode(' ', $input);
    print "Searching Words ".implode(",", $sentence);
    foreach($sentence as $item){
        $suggestionsArray = checkSpelling($item, $words);
        if (is_array($suggestionsArray)){
                echo $item, " not found, maybe you meant";
                var_dump($suggestionsArray);
        } else {
                echo "found $item";
        }
    }
}
function rankSuggestions($a, $b){
        return $a[1]-$b[1];
}
?>
<!doctype html>
<html lang="en">
    <head>
        <meta charset="utf-8" />
        <title>Spell Check</title>
    </head>
    <body>
        <form method="get">
             <input type="text" name="check" autocomplete="off" autofocus />
        </form>
    </body>
 </html>

对数据进行实际配对的最简单方法是将字典拆分为多个文件，这些文件由字符串中的第一个字符等分区。dict.a.txt、dict.b.txt、dict.c.txt 等类似的东西。

php - 提高这个 PHP 拼写检查器的速度和效率

1 回答 1

Related

Reference