php - 如何在 PHP 中找到两个字符串之间的最大公共子字符串？

Question

是否有一种快速算法可以在两个中找到最大公共子串，strings还是 NPComplete 问题？

在 PHP 中，我可以大海捞针：

<?php

if (strstr("there is a needle in a haystack", "needle")) {
    echo "found<br>\n";
}
?>

我想我可以在其中一个上循环执行此操作，strings但这将非常昂贵！特别是因为我的应用是搜索电子邮件数据库并查找垃圾邮件（即同一个人发送的类似电子邮件）。

有没有人可以扔掉任何 PHP 代码？

score 10 · Accepted Answer

similar_text函数可能是您想要的。

这会计算两个字符串之间的相似度。返回两个字符串中匹配字符的数量

您可能还想看看levenshtein

score 7 · Accepted Answer

特别是因为我的应用是搜索电子邮件数据库并查找垃圾邮件（即同一个人发送的类似电子邮件）。

我认为你应该看看贝叶斯垃圾邮件推理算法，不一定是最长的公共子串。

http://www.devshed.com/c/a/PHP/Implement-Bayesian-inference-using-PHP-Part-1/

score 6 · Accepted Answer

我刚刚编写了一个函数，用于查找 str1 中存在于 str2 中的最长子字符串

public static function getLongestMatchingSubstring($str1, $str2)
{
    $len_1 = strlen($str1);
    $longest = '';
    for($i = 0; $i < $len_1; $i++){
        for($j = $len_1 - $i; $j > 0; $j--){
            $sub = substr($str1, $i, $j);
            if (strpos($str2, $sub) !== false && strlen($sub) > strlen($longest)){
                $longest = $sub;
                break;
            }
        }
    }
    return $longest;
}

score 4 · Accepted Answer

迟到了，但这里有一种方法可以在字符串数组中找到最大的公共子字符串：

例子：

$array = array(
    'PTT757LP4',
    'PTT757A',
    'PCT757B',
    'PCT757LP4EV'
);
echo longest_common_substring($array); // => T757

功能：

function longest_common_substring($words) {
    $words = array_map('strtolower', array_map('trim', $words));
    $sort_by_strlen = create_function('$a, $b', 'if (strlen($a) == strlen($b)) { return strcmp($a, $b); } return (strlen($a) < strlen($b)) ? -1 : 1;');
    usort($words, $sort_by_strlen);
    // We have to assume that each string has something in common with the first
    // string (post sort), we just need to figure out what the longest common
    // string is. If any string DOES NOT have something in common with the first
    // string, return false.
    $longest_common_substring = array();
    $shortest_string = str_split(array_shift($words));

    while (sizeof($shortest_string)) {
        array_unshift($longest_common_substring, '');
        foreach ($shortest_string as $ci => $char) {
            foreach ($words as $wi => $word) {
                if (!strstr($word, $longest_common_substring[0] . $char)) {
                    // No match
                    break 2;
                } // if
            } // foreach
            // we found the current char in each word, so add it to the first longest_common_substring element,
            // then start checking again using the next char as well
            $longest_common_substring[0].= $char;
        } // foreach
        // We've finished looping through the entire shortest_string.
        // Remove the first char and start all over. Do this until there are no more
        // chars to search on.
        array_shift($shortest_string);
    }
    // If we made it here then we've run through everything
    usort($longest_common_substring, $sort_by_strlen);
    return array_pop($longest_common_substring);
}

我在我的博客上写了一点：

使用 PHP 查找最长的公共子字符串（2011 年 2 月 24 日）

score 3 · Accepted Answer

从那以后，我找到了一篇相关的维基百科文章。这不是一个 NP 完全问题，它可以使用动态规划算法在 O(mn) 时间内完成。

在 PHP 中，我发现similar_text函数非常有用。这是一个代码示例，用于检索一系列文本电子邮件并遍历它们并找到彼此之间 90% 相似的电子邮件。注意：像这样的东西是不可扩展的：

<?php
// Gather all messages by a user into two identical associative arrays
$getMsgsRes = mysql_query(SELECT * FROM email_messages WHERE from = '$someUserID');
while($msgInfo = mysql_fetch_assoc($getMsgsRes))
{
    $msgsInfo1[] = $msgInfo;
    $msgsInfo2[] = $msgInfo;
}

// Loop over msgs and compare each one to every other
foreach ($msgsInfo1 as $msg1)
    foreach ($msgsInfo2 as $msg2)
        similar_text($msg1['msgTxt'],$msg2['msgTxt'],$similarity_pst);
        if ($similarity_pst > 90)
            echo "{$msg1['msgID']} is ${similarity_pst}% to {$msg2['msgID']}\n";
?>

score 1 · Accepted Answer

请查看Wikibooks上的算法实现/字符串/最长公共子字符串。我还没有测试过 PHP 实现，但它似乎与 Wikipedia 页面上的一般算法相匹配。

php - 如何在 PHP 中找到两个字符串之间的最大公共子字符串？

6 回答 6

Related

Reference