string - 最长匹配子串，与字符顺序无关

Question

我找不到以下假设面试问题的答案：

给定两个长度为 N 的字符串序列，无论顺序如何，如何找到匹配子串的最大长度。

例如，给定seq1 = "ABCDEFG", 和seq2 = "DBCAPFG"，最大长度窗口为 4。（ABCDfromseq1和DBCAfrom seq2）。

score 8 · Accepted Answer

这是一个 O(n) 解决方案（假设固定字母大小和 O(1) 字典访问）。

使用单个频率表，对 seq1 中的字符进行向上计数，对 seq2 中的字符进行向下计数。如果此直方图再次采用相同的值，我们将有一个匹配窗口（因为这意味着我们必须有相同数量的中间字符）。

所以我们使用字典来存储直方图的先前值。

Python实现：

def find_window(seq1,seq2):
    """Yield length of matching windows"""
    D={} # Dictionary with key=histogram, value = first position where this happens
    H=[0]*26 # H[x] is count of x in seq1 - count of x in seq2
    D[tuple(H)]=-1
    for i,(a,b) in enumerate(zip(seq1,seq2)):
        a=ord(a)-ord('A')
        b=ord(b)-ord('A') 
        H[a]+=1
        H[b]-=1
        key=tuple(H)
        if key in D:
            yield i-D[key]
        if key not in D:
            D[key]=i

print max(find_window("ABCDEFG","DBCAPFG"))

如果你有一个更大的字母表，你可以使用类似的技术，只存储一个哈希值而不是完整的直方图。例如，您可以简单地为每个字符选择一个随机代码，然后添加 seq 中每个字母的代码，或者减去 seq2 中的字母。

如果发生哈希冲突，您将需要第二遍检查建议的匹配是否正确。

score 0 · Accepted Answer

假设

对于给定的数组 A[0..n], B[0..m]

有限字母表

不重复的字符

 ans = -1
 j = index of A[0] in B
 if j > n then no substring exists
 for 1 .. n
     find A[i] in B[0..j]
     if not found
         j = find A[i] in B[j+1..m]
         if not found, return last ans
     else 
         if i == j then ans = j

如果 B[0..j] 已排序（可能构造二叉树 P），则在 B[0..j] 中搜索 A[i] 将是 O(log n) 并且整个算法将是 O(n log n)

帮助验证这一点会很棒。
有反例吗？

score 0 · Accepted Answer

这肯定不是最佳算法，但没有人提供功能齐全的解决方案，所以这里有一个简单的方法：

def sort_possible_substrings(p_str):
    """The function iterates through every possible substring of p_str and 
    sorts the characters within the substring and finally inserts them in 
    a set which will be returned."""

    S = set()
    # the possible length of the substrings of p_str
    for s_len in range(1, len(p_str)+1):
        # the substrings of p_str with s_len length
        for s in range(len(p_str)-(s_len-1)):
            sorted_substr = ''.join(sorted(p_str[s:s+s_len]))
            S.add(sorted_substr)
    return S


def longest_common_perm_len(str1, str2):
    """Build up a set from the substrings of the shorter string with the help of 
    sort_possible_substrings. Iterate through the possible substrings of the
    longer string from the longest to the shortest and check whether the same
    string is contained by "candidates" thus by the shorter string."""
    shorter, longer = (str1,str2) if len(str1) <= len(str2) else (str2,str1)
    candidates = sort_possible_substrings(shorter)
    for win_len in range(len(shorter),0,-1):
        for s in range(len(longer)-(win_len-1)):
            str_chunk = ''.join(sorted(longer[s:s+win_len]))
            if str_chunk in candidates:
                return win_len
    return 0

（子字符串中字符的排序可以用 Peter de Rivaz 的“直方图”解决方案代替。）

这种不是最优的、直接的算法给出：

lsubstr.longest_common_perm_len("ABCDZZEFG", "XDBCAXFEG") -> 4
lsubstr.longest_common_perm_len("ABCD", "XDXACXB") -> 1

string - 最长匹配子串，与字符顺序无关

3 回答 3

Related

Reference