python - increase speed of substring searching in a list

Question

I am quite new to Python and I am struggling to increase the speed of one piece of code.

I have a dictionary containing 500k DNA sequences. As a key, I have the identifier of the sequence, while as a value I have the corresponding DNA sequence. These sequences are of variable length (it is just a string containing CTACTA...) that could has 200 to 60k nucleotides. I need to remove DNA sequences that are substrings of larger sequences.

I wrote this:

def remove_subs():

    #Create a list of values based on reversed lenght
    LISTA=sorted(list(x for x in finaldic.values()), key=len, reverse=True)

    LISTA2=[]

    for a in range(len(LISTA)):
        #run the same list but in opposite direction 
        for b in range(len(sorted(LISTA,key=len))):
            if len(LISTA[b])<len(LISTA[a]):
                if LISTA[a].find(LISTA[b])!=-1 or Bio.Seq.reverse_complement(LISTA[a]).find(LISTA[b])!=-1 and LISTA[b]!=LISTA[a]:
                    LISTA2.append(LISTA[a])

I am trying to identify those substring sequences by running in two for loops, a list containing only the DNA sequences (ordered by length), in opposite directions using the built-in .find

This code works perfectly but takes ages to run such amount of information. I am quite sure that exists some faster option.

Can you help?

score 1 · Accepted Answer

From an algorithmic perspective, you likely should look at suffix trees. First, you build a generalized suffix tree from the strings you want to look in, which has an O(n) time complexity to construct (where n = number of characters in all strings to search through). Then, you can query that tree and as it if a substring is contained within it, which has a O(m) time complexity, where m is length of the substring. This, essentially, is as fast as it possibly could be.

Stack overflow question describing a few suffix tree libraries:

python: library for generalized suffix trees

Unfortunately, the examples here are not terribly mature codebases... There are C libraries which are significantly more focused on optimization and so on. Nonetheless, something like this suffix tree algorithm should be a simple drop-in replacement for your code:

import SubstringDict
d = SubstringDict.SubstringDict()
d['foobar'] = 1  
d['barfoo'] = 2
d['forget'] = 3
d['arfbag'] = 4

print(d['a'])
# [1, 2, 4]
print(d['arf'])
# [2, 4]
print (d['oo'])
# [1, 2]
print(d['food'])
# []

Searching and matching strings is a pretty large and active area in bioinformatics, and there is an entire body of literature on this problem.

score 0 · Accepted Answer

Just to clean this up so it's a bit more understandable:

def remove_subs():
    list_a = sorted(list(x for x in finaldic.values()), key=len, reverse=True)
    matches = []
    for first in list_a:
        for second in (sorted(list_a, key=len)):
            if first in second or first in Bio.Seq.reverse_complement(second):
                matches.append(first)
                break

You might see a speed-up just by using break.

This can be made smaller by using:

def remove_subs():
    list_a = sorted(list(x for x in finaldic.values()), key=len, reverse=True)
    matches = []
    for s in list_a:
        if any(substring in s for substring in list_a):
            matches.append(s)

Also, use this topic as a reference for algorithms.

score 0 · Accepted Answer

Here are some fixes that might increase your speed. At the least it will make your code more idiomatic to python.

def remove_subs():

    #Create a list of values based on reversed lenght
    list_a=sorted((x for x in finaldic.values()), key=len, reverse=True)

    list_a_2=[]

    for a in list_a:
        #run the same list but in opposite direction 
        for b in sorted(list_a,key=len):
            if len(b)<len(a):
                if b in a or b in Bio.Seq.reverse_complement(a) and b!=a:
                    list_a_2.append(a)

Two main changes: 1) Instead of using the .find method, I am using python's in operator to search. 2) Instead of indexing your lists, just loop over them directly.

You could probably get away with the if len(b) < len(a) condition, since b will never be in a if that is not true.

score 0 · Accepted Answer

I have an idea that could help, what about hashing the sequences? If the length of the smallest sequence is 200, then I would do a rolling hash (http://en.wikipedia.org/wiki/Rolling_hash) with window size being 200. I would then use the hash as a key into a dictionary, which would hold a list of sequence identifiers. Then if there is a list of size > 1, it's a candidate for substring (there can be collisions), and you can use find.

score 0 · Accepted Answer

Not having any test data or self-contained code, it's hard to test, but I'll point out that sorting inside a loop is rarely a good idea. This should bring the running time down to O(n^2) from O(n^3 * logn) :

def remove_subs():
    list_a_backward = sorted(list(x for x in finaldic.values()), key=len, reverse=True)
    list_a_forward = list_a_backward
    list_a_forward.reverse()

    matches = []
    for first in list_a_backward:
        for second in list_a_forward:
            if first in second or first in Bio.Seq.reverse_complement(second):
                matches.append(first)
                break

You could try Pypy too, since you appear to be running pure python. Failing that, numba or Cython might help.

python - increase speed of substring searching in a list

5 回答 5

Related

Reference