16

我是 Python 新手,已经花了很多时间来解决这个问题,希望有人能帮助我。我需要找到两个序列之间的重叠。重叠在第一个序列的左端和第二个序列的右端。我希望函数找到重叠并返回它。

我的序列是:

s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"

我的函数应该命名为

def getOverlap(left, right)

作为s1左序列,作为右序列s2

结果应该是

‘GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC’

任何帮助表示赞赏

4

4 回答 4

22

Have a look at the difflib library and more precisely at find_longest_match():

import difflib

def get_overlap(s1, s2):
    s = difflib.SequenceMatcher(None, s1, s2)
    pos_a, pos_b, size = s.find_longest_match(0, len(s1), 0, len(s2)) 
    return s1[pos_a:pos_a+size]

s1 = "CGATTCCAGGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC"
s2 = "GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTCGTCCAGACCCCTAGC"

print(get_overlap(s1, s2)) # GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC
于 2013-01-02T20:46:53.867 回答
13

You could use difflib.SequenceMatcher:

d = difflib.SequenceMatcher(None,s1,s2)
>>> match = max(d.get_matching_blocks(),key=lambda x:x[2])
>>> match
Match(a=8, b=0, size=39)
>>> i,j,k = match
>>> d.a[i:i+k]
'GGCTCCCCACGGGGTACCCATAACTTGACAGTAGATCTC'
>>> d.a[i:i+k] == d.b[j:j+k]
True
>>> d.a == s1
True
>>> d.b == s2
True
于 2013-01-02T20:45:30.800 回答
6

Knuth-Morris-Pratt算法是一种在另一个字符串中查找一个字符串的好方法(因为我看到了 DNA,我猜你希望这个扩展到......数十亿?)。

# Knuth-Morris-Pratt string matching
# David Eppstein, UC Irvine, 1 Mar 2002

from __future__ import generators

def KnuthMorrisPratt(text, pattern):

    '''Yields all starting positions of copies of the pattern in the text.
Calling conventions are similar to string.find, but its arguments can be
lists or iterators, not just strings, it returns all matches, not just
the first one, and it does not need the whole text in memory at once.
Whenever it yields, it will have read the text exactly up to and including
the match that caused the yield.'''

    # allow indexing into pattern and protect against change during yield
    pattern = list(pattern)

    # build table of shift amounts
    shifts = [1] * (len(pattern) + 1)
    shift = 1
    for pos in range(len(pattern)):
        while shift <= pos and pattern[pos] != pattern[pos-shift]:
            shift += shifts[pos-shift]
        shifts[pos+1] = shift

    # do the actual search
    startPos = 0
    matchLen = 0
    for c in text:
        while matchLen == len(pattern) or \
              matchLen >= 0 and pattern[matchLen] != c:
            startPos += shifts[matchLen]
            matchLen -= shifts[matchLen]
        matchLen += 1
        if matchLen == len(pattern):
            yield startPos

我获得 KMP python 代码的链接(和一个内置函数,由于运行时常量,它对于小问题会更快)。

对于前沿性能,使用字符串的前缀表和散列窗口作为基数 4 整数(在生物学中,您将它们称为 k-mers 或 oligos)。; )

祝你好运!

编辑:还有一个不错的技巧,您可以对包含第一个字符串中的每个前缀(总共 n 个)和第二个字符串中的每个前缀(总共 n 个)的列表进行排序。如果它们共享最大的公共子序列,那么它们在排序列表中一定是相邻的,因此从排序列表中最接近的另一个字符串中找到元素,然后取完全匹配的最长前缀。:)

于 2013-01-02T20:43:49.637 回答
4

最长公共子串

def LongestCommonSubstring(S1, S2):
  M = [[0]*(1+len(S2)) for i in xrange(1+len(S1))]
  longest, x_longest = 0, 0
  for x in xrange(1,1+len(S1)):
    for y in xrange(1,1+len(S2)):
        if S1[x-1] == S2[y-1]:
            M[x][y] = M[x-1][y-1] + 1
            if M[x][y]>longest:
                longest = M[x][y]
                x_longest  = x
        else:
            M[x][y] = 0
  return S1[x_longest-longest: x_longest]
于 2013-01-02T20:38:42.447 回答