python - 在python中查找字符串列表的modal子字符串

Question

找到一个共同的子串已经在许多问题中得到解答，即给定一个字符串列表，找到对所有字符串都通用的（通常是最长或最多的单词）子串。看这里：

我的问题是，如何在字符串列表中找到最长的模态子字符串？这里重要的规定是这个子字符串不一定要出现在列表中的所有字符串中。

这里的科学有一点艺术性，因为明显的权衡是在 1）您希望子字符串出现在多少个字符串中？和 2) 你希望子字符串有多长？为了解决这个问题，让我们假设我们希望所需的子字符串包含三个单词（如果此处出现平局，则取最长的字符串，然后是第一个实例）。

所以给定清单，

mylist = ["hey how's it going?", "I'm fine  thanks.", "Did you get that thing?", "Of course I got that thing, that's why I asked you: how's it going?"]

所需的输出是，

"how's it going"

如果规定是两个字长，那么所需的输出将是，

"that thing"

因为“that thing”是一个比“how's it”或“it going”更长的字符串

代码中的上述答案分别是三个和两个单词长的模态子字符串。

编辑：

由于对此有赏金，我将更具体地说明模态子字符串是什么。

模态子串：对于子串中给定长度的单词（这是唯一标识模态子串所必需的），模态子串是列表中最大数量的字符串共有的子串。如果存在平局（即对于给定长度的子字符串中的三个单词，有两个候选子字符串都出现在 80% 的字符串中），则应使用具有最长字符长度的子字符串。如果在那之后仍然存在平局（这应该不太可能但很好解释），那么只需选择第一个或随机选择。

一个好的答案将有一个函数，该函数返回子字符串中给定数量的单词的模态子字符串（其中单词的数量可以是任意数字）。

一个令人难以置信的答案将免除“给定单词数”的限制，而是包含一个标量（例如 \alpha），它管理子字符串长度（以单词为单位）和它出现在列表中的次数之间的权衡。接近 1 的 Alpha 将选择一个非常长（以单词表示）但不一定在列表中出现多次的模态子字符串。接近 0 的 Alpha 将选择在列表中出现尽可能多且不关心子字符串长度的模态子字符串。不过，我并没有真正期待这一点，并且会接受一个回答原始问题的答案。

score 2 · Accepted Answer

查找给定长度的模态子串

我们的输入如下：

mylist = ["hey how's it going?", "I'm fine  thanks.", "Did you get that thing?", "Of course I got that thing, that's why I asked you: how's it going?"]

首先，我们将定义一个预处理函数，以清理多余的空格和标点符号。这使得从句子中提取单词变得更容易：

def preprocess(strings):
    # used this answer to get rid of extra whitespaces: https://stackoverflow.com/a/1546251/6735980
    return [" ".join(string.split()).replace(",", "").replace(".", "").replace("?", "") for string in strings]

我们还将定义一个辅助函数来查找n -grams（一个句子中n 个单词的子字符串）：

def find_n_grams(string, n):
    words = string.split(" ")
    n_grams = []

    for i in range(len(words) - n + 1):
        n_grams.append(" ".join([words[i + idx] for idx in range(n)]))

    return n_grams

然后，我们可以使用以下函数为给定数量的单词找到问题中定义的“模态子字符串”。我不确定这是否是计算它的最佳/有效方法，但它可以完成工作：

def find_modal_substring(strings, num_words):
    n_grams_per_string = [find_n_grams(string, num_words) for string in strings]
    max_num_occurences = 0
    modal_substring = None

    for i in range(len(strings)):
        n_grams = n_grams_per_string[i]

        for n_gram in n_grams:
            num_occurences = 1

            for j in range(i + 1, len(strings)):
                if n_gram in n_grams_per_string[j]:
                    num_occurences += 1

            if num_occurences > max_num_occurences:
                max_num_occurences = num_occurences
                modal_substring = n_gram
            elif num_occurences == max_num_occurences and len(modal_substring) < len(n_gram):
                max_num_occurences = num_occurences
                modal_substring = n_gram

    return modal_substring

一些输入/输出对：

> print(find_modal_substring(preprocess(mylist), 3))

怎么样了

> print(find_modal_substring(preprocess(mylist), 2))

那个东西

使用 alpha 参数查找模态子字符串

这部分问题中最困难的事情是如何在给定 alpha 参数的情况下定义得分函数。我们知道：

接近 1 的 Alpha 将选择一个非常长（以单词表示）但不一定在列表中出现多次的模态子字符串。接近 0 的 Alpha 将选择在列表中出现尽可能多且不关心子字符串长度的模态子字符串。

以下是满足这一点的一个得分函数，但我们可能还可以想到许多其他得分函数。通过将它隔离在一个函数中，任何人都可以根据自己的需要轻松修改它：

def compute_score(n_gram, occurences, alpha):
    return alpha * len(n_gram.split(" ")) + (1.0 - alpha) * occurences

鉴于此，查找基于 alpha 的模态子字符串的函数相当长，因为它首先需要为所有n找到所有可能的n -gram ，但我们开始：

def find_modal_substring_alpha(strings, alpha):
    n_grams_per_n_per_string = []
    n = 1

    while True:
        n_grams_per_string = [find_n_grams(string, n) for string in strings]

        if all(n_grams == [] for n_grams in n_grams_per_string):
            break
        else:
            n_grams_per_n_per_string.append(n_grams_per_string)

        n += 1

    best_modal_substring = None
    best_score = 0.0

    for n_grams_per_string in n_grams_per_n_per_string:
        for i in range(len(strings)):
            n_grams = n_grams_per_string[i]

            for n_gram in n_grams:
                num_occurences = 1

                for j in range(i + 1, len(strings)):
                    if n_gram in n_grams_per_string[j]:
                        num_occurences += 1

                score = compute_score(n_gram, num_occurences, alpha)

                if score > best_score:
                    best_score = score
                    best_modal_substring = n_gram
                elif score == best_score and len(best_modal_substring) < len(n_gram):
                    best_score = score
                    best_modal_substring = n_gram

    return best_modal_substring

事实证明，在这里调整 alpha 有点困难。即使对于相对较低的 alpha ( 0.3)，我们得到的结果与我们得到的完全相同1.0：

> print(find_modal_substring_alpha(preprocess(mylist), 0.3))

我当然知道了，所以我问你：最近怎么样

如果我们将 alpha 一直向下移动到0.01例如，我们得到：

> print(find_modal_substring_alpha(preprocess(mylist), 0.01))

怎么样了

python - 在python中查找字符串列表的*modal*子字符串

1 回答 1

查找给定长度的模态子串

使用 alpha 参数查找模态子字符串

Related

Reference

python - 在python中查找字符串列表的modal子字符串