I'm new to both Python and bioinformatics, but I'm working my way through the rosalind.info web site to learn some of both. You do this with a suffix tree. A suffix tree (see http://en.wikipedia.org/wiki/Suffix_tree) is the magical data structure that makes all things possible in bioinformatics. You quickly locate common substrings in multiple long sequences. Suffix trees only require linear time, so length 10,000,000 is feasible.
So first find the reverse complement of the sequence. Then put both into the suffix tree, and find the common substrings between them (of some minimum length).
The code below uses this suffix tree implementation: http://www.daimi.au.dk/~mailund/suffix_tree.html. It's written in C with Python bindings. It won't handle a large number of sequences, but two is no problem. However I can't say whether this will handle length 10,000,000.
from suffix_tree import GeneralisedSuffixTree
baseComplement = { 'A' : 'T', 'C' : 'G', 'G' : 'C', 'T' : 'A' }
def revc(seq):
return "".join([baseComplement[base] for base in seq[::-1]])
data = "AGGGTTTCCCTGACCTTCACTGCAGGTCATGCA"
# revc TGCATGACCTGCAGTGAAGGTCAGGGAAACCCT
# 012345678901234567890123456789012
# 1 2 3
minlength = 6
n = len(data)
tree = GeneralisedSuffixTree([data, revc(data)])
for shared in tree.sharedSubstrings(minlength):
#print shared
_, start, stop = shared[0]
seq = data[start:stop]
_, rstart, rstop = shared[1]
rseq = data[n-rstop:n-rstart]
print "Match: {0} at [{1}:{2}] and {3} at [{4}:{5}]".format(seq, start, stop, rseq, n-rstop, n-rstart)
This produces output
Match: AGGTCA at [23:29] and TGACCT at [10:16]
Match: TGACCT at [10:16] and AGGTCA at [23:29]
Match: CTGCAG at [19:25] and CTGCAG at [19:25]
It finds each match twice, once from each end. And there's a reverse palindrome in there, too!