1

I am somewhat puzzled by a strange behaviour in the difflib library. I try to find overlapping sequences in strings (actually Fasta sequences from a Rosalind task) to glue them together. The code adapted from here works well with a smaller length of the string (for the sake of the clarity, I construct here an example from a common substring a):

import difflib

def glue(seq1, seq2):     
    s = difflib.SequenceMatcher(None, seq1, seq2)
    start1, start2, overlap = s.find_longest_match(0, len(seq1), 0, len(seq2)) 
    if start1 + overlap == len(seq1) and start2 == 0:
        return seq1 + seq2[overlap:]
    #no overlap found, return -1
    return -1


a = "ABCDEFG"
s1 = "XXX" + a
s2 = a + "YYY"

print(glue(s1, s2))

Output

XXXABCDEFGYYY

But when the string is longer, difflib doesn't find the match any more.

a = "AGGTGTGCCTGTGTCTATACATCGTACGCGGGAAGGTCCAAGTTAACATGGGGTACTGTAATGCACACGTACGCGGGAAGGTCCAAGTTAACTACGAAACGCGAGCCCATCTTTGCCGGTGTTAACTTGCTGTCAGGTGTTTGGCAAGGATCTTTGTTTGCCGGTGTTAACTTGCTGTCAGGTGTTTGGCCGGTGTTAACTTGCTGTCAGATGCGCGCCACGGCCAAATTCTAGGCACGCCAAATTCTAGGCACTTTAAGTGGTTCGATGATCCACGATGGTAAGCCAGCCGTACTTGC"

s1 = "XXX" + a
s2 = a + "YYY"

print(glue(s1, s2))

Output

-1

Why does this happen and how can you use difflib with longer strings?

4

1 回答 1

1

我注意到,当“ATCG”字符串超过 200 时,这种行为就开始了。在difflib文档页面上寻找这个数字让我看到了这段话:

自动垃圾启发式:SequenceMatcher 支持自动将某些序列项视为垃圾的启发式。启发式计算每个单独项目在序列中出现的次数。如果一个项目的重复项(在第一个之后)占序列的 1% 以上,并且序列长度至少为 200 个项目,则该项目被标记为“流行”并被视为垃圾,以进行序列匹配。在创建 SequenceMatcher 时,可以通过将 autojunk 参数设置为 False 来关闭此启发式方法。

由于序列仅包含ATCG,它们当然超过 200 个字符序列的 1%,因此被认为是垃圾。将代码更改为

import difflib

def glue(seq1, seq2):     
    s = difflib.SequenceMatcher(None, seq1, seq2, autojunk = False)
    start1, start2, overlap = s.find_longest_match(0, len(seq1), 0, len(seq2)) 
    if start1 + overlap == len(seq1) and start2 == 0:
        return seq1 + seq2[overlap:]
    return -1

摆脱这种行为并允许无限制地进行子字符串匹配。
我想我把它留在这里,因为我浪费了几个小时来梳理我的代码来找到这个问题。

于 2018-05-05T18:54:46.913 回答