php - 在不知道子字符串的情况下计算单词列表中子字符串的唯一外观？

Question

*我尝试计算单词列表中子字符串的唯一外观* 因此检查单词列表并检测是否在任何单词中存在基于多次出现的最小字符的子字符串并计算它们。我不知道任何子字符串。

这是一个可行的解决方案，您知道子字符串，但如果您不知道怎么办？单词基于的最小字符数。

将找到“Book”是单词子串的所有单词。具有以下 php 功能。

想要的结果：

book count (5)
stor count (2)

score 1 · Accepted Answer

这是我的第一个近似值：未完成，未经测试，至少有 1 个错误，并且是用 eiffel 编写的。好吧，我不会为你做所有的工作。

deferred class
    SUBSTRING_COUNT
feature
    threshold : INTEGER_32 =5

    biggest_starting_substring_length(a,b:STRING):INTEGER_32
        deferred
    end

    biggest_starting_substring(a,b:STRING):STRING
    do
        Result := a.substring(0,biggest_starting_substring_length(a,b))
    end

    make_list_of_substrings(a,b:STRING)
    local
        index:INTEGER_32
        this_one: STRING
    do
        from
            a_index := b_index + 1
        invariant
            a_index >=0 and a_index <= a.count
        until
            a_index >= a.count
        loop
            this_one := biggest_starting_substring(a.substring (a_index, a.count-1),b)
            if this_one.count > threshold then
                list.extend (this_one)
            end
        variant
            a.count - a_index
        end
    end -- biggest_substring

    list : ARRAYED_LIST[STRING]

end

score 1 · Accepted Answer

给定一个长度为 100 的字符串

book bookstore bookworm booking book cooking boring bookingservice.... ok
0123456789...                                                     ... 100

你的算法可能是：

调查来自不同起点和子串长度的子串。您从 0 开始，长度为 1-100 的所有子字符串，因此：0-1, 0-2, 0-3,... 并查看这些子字符串中的任何一个是否在整个字符串中出现多次。从增加的位置开始遍历字符串，搜索从 1 开始的所有子字符串，即 1-2、1-3、1-4，... 等等，直到达到 99-100。

保留所有子字符串及其出现次数的表格，您可以对它们进行排序。

您可以通过指定最小和最大长度进行优化，这会显着减少搜索次数和命中率。此外，一旦您找到一个子字符串，就将它们保存在一组搜索到的子字符串中。如果您再次遇到子字符串，请跳过它。（即book，您已经计算过的命中数在您命中下一个子字符串时不应再次计算book）。此外，您永远不必搜索长度超过总字符串一半的字符串。

对于示例字符串，您可能会针对字符串的唯一性运行附加测试。你会有

o              x ..
oo             x  7
bo             x  7
ok             x  6 
book           x  5
booking        x  2
bookingservice x  1

忽略短于 3 的刺（并且长于总文本字符串的一半），你会得到

book           x  5
booking        x  2
bookingservice x  1

这已经是一个相当合理的结果。

[编辑] 这显然会查看所有字符串，而不仅仅是自然词。

[编辑] 通常我不喜欢为 OP 编写代码，但在这种情况下，我自己有点感兴趣：

$string = "book bookshelf booking foobar bar booking ";
$string .= "selfservice bookingservice cooking";

function search($string, $min = 4, $max = 16, $threshhold = 2) {
    echo "<pre><br/>";
    echo "searching <em>'$string'</em> for string occurances ";
    echo "of length $min - $max: <br/>";

    $hits = array();
    $foundStrings = array();

    // no string longer than half of the total string will be found twice
    if ($max > strlen($string) / 2) {
        $max = strlen($string);
    }

    // examin substrings:
    // start from 0, 1, 2...
    for ($start = 0; $start < $max; $start++) {

        // and string length 1, 2, 3, ... $max
        for ($length = $min; $length < strlen($string); $length++) {

            // get the substring in question, 
            // but search for natural words (trim)
            $substring = trim(substr($string, $start, $length));

            // if substring was not counted yet, 
            // add the found count to the hits
            if (!in_array($substring, $foundStrings)) {
                preg_match_all("/$substring/i", $string, $matches);
                $hits[$substring] = count($matches[0]);
            }
        }
    }

    // sort the hits array desc by number of hits
    arsort($hits);

    // remove substring hits with hits less that threshhold
    foreach ($hits as $substring => $count) {
        if ($count < $threshhold) {
            unset($hits[$substring]);
        }
    }

    print_r($hits);
}

search($string);

?>

注释和变量名应该让代码自己解释。在您的情况下，$string 将用于读取文件。此示例将输出：

searching 'book bookshelf booking foobar bar booking selfservice 
bookingservice cooking' for string occurances of length 4 - 16: 
Array
(
    [ook] => 6
    [book] => 5
    [boo] => 5
    [bookin] => 3
    [booking] => 3
    [booki] => 3
    [elf] => 2
)

让我知道你是如何实现它的:)

php - 在不知道子字符串的情况下计算单词列表中子字符串的唯一外观？

2 回答 2

Related

Reference