I am somewhat puzzled by a strange behaviour in the difflib
library. I try to find overlapping sequences in strings (actually Fasta sequences from a Rosalind task) to glue them together. The code adapted from here works well with a smaller length of the string (for the sake of the clarity, I construct here an example from a common substring a
):
import difflib
def glue(seq1, seq2):
s = difflib.SequenceMatcher(None, seq1, seq2)
start1, start2, overlap = s.find_longest_match(0, len(seq1), 0, len(seq2))
if start1 + overlap == len(seq1) and start2 == 0:
return seq1 + seq2[overlap:]
#no overlap found, return -1
return -1
a = "ABCDEFG"
s1 = "XXX" + a
s2 = a + "YYY"
print(glue(s1, s2))
Output
XXXABCDEFGYYY
But when the string is longer, difflib
doesn't find the match any more.
a = "AGGTGTGCCTGTGTCTATACATCGTACGCGGGAAGGTCCAAGTTAACATGGGGTACTGTAATGCACACGTACGCGGGAAGGTCCAAGTTAACTACGAAACGCGAGCCCATCTTTGCCGGTGTTAACTTGCTGTCAGGTGTTTGGCAAGGATCTTTGTTTGCCGGTGTTAACTTGCTGTCAGGTGTTTGGCCGGTGTTAACTTGCTGTCAGATGCGCGCCACGGCCAAATTCTAGGCACGCCAAATTCTAGGCACTTTAAGTGGTTCGATGATCCACGATGGTAAGCCAGCCGTACTTGC"
s1 = "XXX" + a
s2 = a + "YYY"
print(glue(s1, s2))
Output
-1
Why does this happen and how can you use difflib
with longer strings?