I am quite new to Python and I am struggling to increase the speed of one piece of code.
I have a dictionary containing 500k DNA sequences. As a key, I have the identifier of the sequence, while as a value I have the corresponding DNA sequence. These sequences are of variable length (it is just a string containing CTACTA...) that could has 200 to 60k nucleotides. I need to remove DNA sequences that are substrings of larger sequences.
I wrote this:
def remove_subs():
#Create a list of values based on reversed lenght
LISTA=sorted(list(x for x in finaldic.values()), key=len, reverse=True)
LISTA2=[]
for a in range(len(LISTA)):
#run the same list but in opposite direction
for b in range(len(sorted(LISTA,key=len))):
if len(LISTA[b])<len(LISTA[a]):
if LISTA[a].find(LISTA[b])!=-1 or Bio.Seq.reverse_complement(LISTA[a]).find(LISTA[b])!=-1 and LISTA[b]!=LISTA[a]:
LISTA2.append(LISTA[a])
I am trying to identify those substring sequences by running in two for loops, a list containing only the DNA sequences (ordered by length), in opposite directions using the built-in .find
This code works perfectly but takes ages to run such amount of information. I am quite sure that exists some faster option.
Can you help?