python - 字符串的子序列

Question

我实现了一个 python 函数，它返回 2 个字符串的最长公共子序列。现在，我想实现一个返回任意数量字符串的最长公共子序列的函数。

我找到了 3 个字符串的帮助：

dp[i, j, k] = / 1 + dp[i - 1, j - 1, k - 1] if A[i] = B[j] = C[k]
              \ max(dp[i - 1, j, k], dp[i, j - 1, k], dp[i, j, k - 1]) otherwise

但我真的不明白这个提示。因此，如果有人可以帮助我，我将不胜感激。最好的问候，马克

score 1 · Accepted Answer

您可以O(N log(N))通过执行类似于带有滚动哈希的二进制搜索的操作来及时执行此操作（其中 N 是序列的组合长度）。

请注意，最长可能公共序列的长度是最小序列的长度，smallestLength。进行如下操作：

初始化：

假设最长公共子序列（我们称之为a）的长度为a=smallestLength/2

算法：

iteration_number += 1
扫描所有列表（如果需要，可以并行！）并执行滚动哈希；这将为每个列表生成 len(list)-(a-1) 哈希
将所有哈希插入到一个集合数据结构中（每个列表一个集合）以实现 O(1) 查找时间
检查是否有任何哈希冲突（取所有集合的交集）：如果有一个或多个冲突，手动确认a在这些位置有一个 -length 子序列公共子序列，因为哈希可能是错误的（尽管如果您选择足够精细的哈希，这在实践中永远不会发生）
你找到共享序列了吗？
- 如果找到这样的序列，请重复上述步骤，但增加假定长度，就像在二分搜索中一样（添加smallestlength/2**iteration_number）
- 如果您没有找到这样的序列，请重复上述步骤，但减少假设的长度，就像在二分搜索中一样（减法smallestlength/2**iteration_number）

score 1 · Accepted Answer

对于无限数量的字符串 n，LCS 问题是 NPComplete。意思是，没有已知的多项式算法能够解决这个问题。也意味着您可以放弃 DP 解决方案：p

这是一个启发式方法的链接，用于近似多个字符串的 LCS。

http://www.aaai.org/ocs/index.php/AAAI/AAAI10/paper/download/1559/2197

score 0 · Accepted Answer

仅供参考：

from difflib import SequenceMatcher


def lcs_of_2(a, b):
    """
    get longest common string
    :param a:
    :param b:
    :return:
    """
    match = SequenceMatcher(None, a, b).find_longest_match(0, len(a), 0, len(b))
    return a[match[0]: match[0] + match[2]]


def lcs_of_list(*args):
    """
    get longest common string of list
    :param args:
    :return:
    """
    if len(args) == 2:
        return lcs_of_2(args[0], args[1])
    first = args[0]
    remains = args[1:]
    return lcs_of_2(first, lcs_of_list(*remains))


if __name__ == '__main__':
    a = 'abcdef'
    b = 'abbbbabcdeffff'
    c = 'bcdefff'
    print(lcs_of_list(a, b, c))

结果：

bcdef

score 0 · Accepted Answer

python3中的一个解决方案是：

#Uses python3
import sys
def calc_cache_pos(strings, indexes):
    factor = 1
    pos = 0
    for s, i in zip(strings, indexes):
        pos += i * factor
        factor *= len(s)
    return pos

def lcs_back(strings, indexes, cache):
    if -1 in indexes:
        return ""
    match = all(strings[0][indexes[0]] == s[i]
                for s, i in zip(strings, indexes))
    if match:
        new_indexes = [i - 1 for i in indexes]
        result = lcs_back(strings, new_indexes, cache) + strings[0][indexes[0]]
    else:
        substrings = [""] * len(strings)
        for n in range(len(strings)):
            if indexes[n] > 0:
                new_indexes = indexes[:]
                new_indexes[n] -= 1
                cache_pos = calc_cache_pos(strings, new_indexes)
                if cache[cache_pos] is None:
                    substrings[n] = lcs_back(strings, new_indexes, cache)
                else:
                    substrings[n] = cache[cache_pos]
        result = max(substrings, key=len)
    cache[calc_cache_pos(strings, indexes)] = result
    return result

def lcs(strings):
        if len(strings) == 0:
        return ""
    elif len(strings) == 1:
        return strings[0]
    else:
        cache_size = 1
        for s in strings:
            cache_size *= len(s)
        cache = [None] * cache_size
        indexes = [len(s) - 1 for s in strings]
        return (lcs_back(strings, indexes, cache))
if __name__ == '__main__':
    input = sys.stdin.read()
    data = list(map(int, input.split()))
    an = data[0]
    data = data[1:]
    a1 = data[:an]
    data = data[an:]
    bn = data[0]
    data = data[1:]
    b1 = data[:bn]
    data = data[bn:]
    cn = data[0]
    data = data[1:]
    c1 = data[:cn]
    a = ''
    for i in a1:
        a = a + i
    b = ''
    for i in b1:
        b = b + i
    c = ''
    for i in c1:
        c = c + i
    print(lcs([a, b, c]))

这会读取 3+ 个数组的输入，每个字符后跟一个空格。在每个数组之前输入数组的大小。输入将是

输入：

8

阿巴克达布

6

bdcaba

6

巴卡

输出：

咩

python - 字符串的子序列

4 回答 4

Related

Reference